All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu
@ 2021-12-22 22:04 ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:04 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: Monk.Liu, horace.chen, christian.koenig

This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another. 

TDR triggered by multiple entities within the same domain due to the same reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
those will still happen after the in flight resets finishes.

v2:
Add handling on SRIOV configuration, the reset notify coming from host 
and driver already trigger a work queue to handle the reset so drop this
intermidiate wq and send directly to timeout wq. (Shaoyun)

[1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.

Andrey Grodzovsky (8):
  drm/amdgpu: Introduce reset domain
  drm/amdgpu: Move scheduler init to after XGMI is ready
  drm/amdgpu: Fix crash on modprobe
  drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
  drm/amdgpu: Drop hive->in_reset
  drm/amdgpu: Drop concurrent GPU reset protection for device
  drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  18 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  18 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |   7 +-
 10 files changed, 147 insertions(+), 164 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu
@ 2021-12-22 22:04 ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:04 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Monk.Liu, Andrey Grodzovsky, horace.chen, christian.koenig, daniel

This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another. 

TDR triggered by multiple entities within the same domain due to the same reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
those will still happen after the in flight resets finishes.

v2:
Add handling on SRIOV configuration, the reset notify coming from host 
and driver already trigger a work queue to handle the reset so drop this
intermidiate wq and send directly to timeout wq. (Shaoyun)

[1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.

Andrey Grodzovsky (8):
  drm/amdgpu: Introduce reset domain
  drm/amdgpu: Move scheduler init to after XGMI is ready
  drm/amdgpu: Fix crash on modprobe
  drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
  drm/amdgpu: Drop hive->in_reset
  drm/amdgpu: Drop concurrent GPU reset protection for device
  drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  18 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  18 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |   7 +-
 10 files changed, 147 insertions(+), 164 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [RFC v2 1/8] drm/amdgpu: Introduce reset domain
  2021-12-22 22:04 ` Andrey Grodzovsky
@ 2021-12-22 22:04   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:04 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Daniel Vetter, horace.chen, Christian König,
	christian.koenig, Monk.Liu

Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Suggested-by: Christian König <ckoenig.leichtzumerken@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 20 +++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 ++
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9f017663ac50..b5ff76aae7e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -812,6 +812,11 @@ struct amd_powerplay {
 
 #define AMDGPU_RESET_MAGIC_NUM 64
 #define AMDGPU_MAX_DF_PERFMONS 4
+
+struct amdgpu_reset_domain {
+	struct workqueue_struct *wq;
+};
+
 struct amdgpu_device {
 	struct device			*dev;
 	struct pci_dev			*pdev;
@@ -1096,6 +1101,8 @@ struct amdgpu_device {
 
 	struct amdgpu_reset_control     *reset_cntl;
 	uint32_t                        ip_versions[HW_ID_MAX][HWIP_MAX_INSTANCE];
+
+	struct amdgpu_reset_domain	reset_domain;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 90d22a376632..0f3e6c078f88 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2391,9 +2391,27 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 	if (r)
 		goto init_failed;
 
-	if (adev->gmc.xgmi.num_physical_nodes > 1)
+	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+		struct amdgpu_hive_info *hive;
+
 		amdgpu_xgmi_add_device(adev);
 
+		hive = amdgpu_get_xgmi_hive(adev);
+		if (!hive || !hive->reset_domain.wq) {
+			DRM_ERROR("Failed to obtain reset domain info for XGMI hive:%llx", hive->hive_id);
+			r = -EINVAL;
+			goto init_failed;
+		}
+
+		adev->reset_domain.wq = hive->reset_domain.wq;
+	} else {
+		adev->reset_domain.wq = alloc_ordered_workqueue("amdgpu-reset-dev", 0);
+		if (!adev->reset_domain.wq) {
+			r = -ENOMEM;
+			goto init_failed;
+		}
+	}
+
 	/* Don't init kfd if whole hive need to be reset during init */
 	if (!adev->gmc.xgmi.pending_reset)
 		amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 567df2db23ac..a858e3457c5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -392,6 +392,14 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
 		goto pro_end;
 	}
 
+	hive->reset_domain.wq = alloc_ordered_workqueue("amdgpu-reset-hive", 0);
+	if (!hive->reset_domain.wq) {
+		dev_err(adev->dev, "XGMI: failed allocating wq for reset domain!\n");
+		kfree(hive);
+		hive = NULL;
+		goto pro_end;
+	}
+
 	hive->hive_id = adev->gmc.xgmi.hive_id;
 	INIT_LIST_HEAD(&hive->device_list);
 	INIT_LIST_HEAD(&hive->node);
@@ -401,6 +409,7 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
 	task_barrier_init(&hive->tb);
 	hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN;
 	hive->hi_req_gpu = NULL;
+
 	/*
 	 * hive pstate on boot is high in vega20 so we have to go to low
 	 * pstate on after boot.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index d2189bf7d428..6121aaa292cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -42,6 +42,8 @@ struct amdgpu_hive_info {
 		AMDGPU_XGMI_PSTATE_MAX_VEGA20,
 		AMDGPU_XGMI_PSTATE_UNKNOWN
 	} pstate;
+
+	struct amdgpu_reset_domain reset_domain;
 };
 
 struct amdgpu_pcs_ras_field {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 1/8] drm/amdgpu: Introduce reset domain
@ 2021-12-22 22:04   ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:04 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, Daniel Vetter, horace.chen, daniel,
	Christian König, christian.koenig, Monk.Liu

Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Suggested-by: Christian König <ckoenig.leichtzumerken@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 20 +++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 ++
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9f017663ac50..b5ff76aae7e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -812,6 +812,11 @@ struct amd_powerplay {
 
 #define AMDGPU_RESET_MAGIC_NUM 64
 #define AMDGPU_MAX_DF_PERFMONS 4
+
+struct amdgpu_reset_domain {
+	struct workqueue_struct *wq;
+};
+
 struct amdgpu_device {
 	struct device			*dev;
 	struct pci_dev			*pdev;
@@ -1096,6 +1101,8 @@ struct amdgpu_device {
 
 	struct amdgpu_reset_control     *reset_cntl;
 	uint32_t                        ip_versions[HW_ID_MAX][HWIP_MAX_INSTANCE];
+
+	struct amdgpu_reset_domain	reset_domain;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 90d22a376632..0f3e6c078f88 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2391,9 +2391,27 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 	if (r)
 		goto init_failed;
 
-	if (adev->gmc.xgmi.num_physical_nodes > 1)
+	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+		struct amdgpu_hive_info *hive;
+
 		amdgpu_xgmi_add_device(adev);
 
+		hive = amdgpu_get_xgmi_hive(adev);
+		if (!hive || !hive->reset_domain.wq) {
+			DRM_ERROR("Failed to obtain reset domain info for XGMI hive:%llx", hive->hive_id);
+			r = -EINVAL;
+			goto init_failed;
+		}
+
+		adev->reset_domain.wq = hive->reset_domain.wq;
+	} else {
+		adev->reset_domain.wq = alloc_ordered_workqueue("amdgpu-reset-dev", 0);
+		if (!adev->reset_domain.wq) {
+			r = -ENOMEM;
+			goto init_failed;
+		}
+	}
+
 	/* Don't init kfd if whole hive need to be reset during init */
 	if (!adev->gmc.xgmi.pending_reset)
 		amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 567df2db23ac..a858e3457c5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -392,6 +392,14 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
 		goto pro_end;
 	}
 
+	hive->reset_domain.wq = alloc_ordered_workqueue("amdgpu-reset-hive", 0);
+	if (!hive->reset_domain.wq) {
+		dev_err(adev->dev, "XGMI: failed allocating wq for reset domain!\n");
+		kfree(hive);
+		hive = NULL;
+		goto pro_end;
+	}
+
 	hive->hive_id = adev->gmc.xgmi.hive_id;
 	INIT_LIST_HEAD(&hive->device_list);
 	INIT_LIST_HEAD(&hive->node);
@@ -401,6 +409,7 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
 	task_barrier_init(&hive->tb);
 	hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN;
 	hive->hi_req_gpu = NULL;
+
 	/*
 	 * hive pstate on boot is high in vega20 so we have to go to low
 	 * pstate on after boot.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index d2189bf7d428..6121aaa292cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -42,6 +42,8 @@ struct amdgpu_hive_info {
 		AMDGPU_XGMI_PSTATE_MAX_VEGA20,
 		AMDGPU_XGMI_PSTATE_UNKNOWN
 	} pstate;
+
+	struct amdgpu_reset_domain reset_domain;
 };
 
 struct amdgpu_pcs_ras_field {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready
  2021-12-22 22:04 ` Andrey Grodzovsky
@ 2021-12-22 22:05   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:05 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: Monk.Liu, horace.chen, christian.koenig

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
 3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0f3e6c078f88..7c063fd37389 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2284,6 +2284,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device *adev)
 	return r;
 }
 
+static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+	long timeout;
+	int r, i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		/* No need to setup the GPU scheduler for rings that don't need it */
+		if (!ring || ring->no_scheduler)
+			continue;
+
+		switch (ring->funcs->type) {
+		case AMDGPU_RING_TYPE_GFX:
+			timeout = adev->gfx_timeout;
+			break;
+		case AMDGPU_RING_TYPE_COMPUTE:
+			timeout = adev->compute_timeout;
+			break;
+		case AMDGPU_RING_TYPE_SDMA:
+			timeout = adev->sdma_timeout;
+			break;
+		default:
+			timeout = adev->video_timeout;
+			break;
+		}
+
+		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
+				   ring->num_hw_submission, amdgpu_job_hang_limit,
+				   timeout, adev->reset_domain.wq, ring->sched_score, ring->name);
+		if (r) {
+			DRM_ERROR("Failed to create scheduler on ring %s.\n",
+				  ring->name);
+			return r;
+		}
+	}
+
+	return 0;
+}
+
+
 /**
  * amdgpu_device_ip_init - run init for hardware IPs
  *
@@ -2412,6 +2453,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 		}
 	}
 
+	r = amdgpu_device_init_schedulers(adev);
+	if (r)
+		goto init_failed;
+
 	/* Don't init kfd if whole hive need to be reset during init */
 	if (!adev->gmc.xgmi.pending_reset)
 		amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 3b7e86ea7167..5527c68c51de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -456,8 +456,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 				  atomic_t *sched_score)
 {
 	struct amdgpu_device *adev = ring->adev;
-	long timeout;
-	int r;
 
 	if (!adev)
 		return -EINVAL;
@@ -477,36 +475,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 	spin_lock_init(&ring->fence_drv.lock);
 	ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
 					 GFP_KERNEL);
-	if (!ring->fence_drv.fences)
-		return -ENOMEM;
 
-	/* No need to setup the GPU scheduler for rings that don't need it */
-	if (ring->no_scheduler)
-		return 0;
+	ring->num_hw_submission = num_hw_submission;
+	ring->sched_score = sched_score;
 
-	switch (ring->funcs->type) {
-	case AMDGPU_RING_TYPE_GFX:
-		timeout = adev->gfx_timeout;
-		break;
-	case AMDGPU_RING_TYPE_COMPUTE:
-		timeout = adev->compute_timeout;
-		break;
-	case AMDGPU_RING_TYPE_SDMA:
-		timeout = adev->sdma_timeout;
-		break;
-	default:
-		timeout = adev->video_timeout;
-		break;
-	}
-
-	r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
-			   num_hw_submission, amdgpu_job_hang_limit,
-			   timeout, NULL, sched_score, ring->name);
-	if (r) {
-		DRM_ERROR("Failed to create scheduler on ring %s.\n",
-			  ring->name);
-		return r;
-	}
+	if (!ring->fence_drv.fences)
+		return -ENOMEM;
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 4d380e79752c..a4b8279e3011 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -253,6 +253,8 @@ struct amdgpu_ring {
 	bool			has_compute_vm_bug;
 	bool			no_scheduler;
 	int			hw_prio;
+	unsigned 		num_hw_submission;
+	atomic_t		*sched_score;
 };
 
 #define amdgpu_ring_parse_cs(r, p, ib) ((r)->funcs->parse_cs((p), (ib)))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready
@ 2021-12-22 22:05   ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:05 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Monk.Liu, Andrey Grodzovsky, horace.chen, christian.koenig, daniel

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
 3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0f3e6c078f88..7c063fd37389 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2284,6 +2284,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device *adev)
 	return r;
 }
 
+static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+	long timeout;
+	int r, i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		/* No need to setup the GPU scheduler for rings that don't need it */
+		if (!ring || ring->no_scheduler)
+			continue;
+
+		switch (ring->funcs->type) {
+		case AMDGPU_RING_TYPE_GFX:
+			timeout = adev->gfx_timeout;
+			break;
+		case AMDGPU_RING_TYPE_COMPUTE:
+			timeout = adev->compute_timeout;
+			break;
+		case AMDGPU_RING_TYPE_SDMA:
+			timeout = adev->sdma_timeout;
+			break;
+		default:
+			timeout = adev->video_timeout;
+			break;
+		}
+
+		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
+				   ring->num_hw_submission, amdgpu_job_hang_limit,
+				   timeout, adev->reset_domain.wq, ring->sched_score, ring->name);
+		if (r) {
+			DRM_ERROR("Failed to create scheduler on ring %s.\n",
+				  ring->name);
+			return r;
+		}
+	}
+
+	return 0;
+}
+
+
 /**
  * amdgpu_device_ip_init - run init for hardware IPs
  *
@@ -2412,6 +2453,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 		}
 	}
 
+	r = amdgpu_device_init_schedulers(adev);
+	if (r)
+		goto init_failed;
+
 	/* Don't init kfd if whole hive need to be reset during init */
 	if (!adev->gmc.xgmi.pending_reset)
 		amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 3b7e86ea7167..5527c68c51de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -456,8 +456,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 				  atomic_t *sched_score)
 {
 	struct amdgpu_device *adev = ring->adev;
-	long timeout;
-	int r;
 
 	if (!adev)
 		return -EINVAL;
@@ -477,36 +475,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 	spin_lock_init(&ring->fence_drv.lock);
 	ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
 					 GFP_KERNEL);
-	if (!ring->fence_drv.fences)
-		return -ENOMEM;
 
-	/* No need to setup the GPU scheduler for rings that don't need it */
-	if (ring->no_scheduler)
-		return 0;
+	ring->num_hw_submission = num_hw_submission;
+	ring->sched_score = sched_score;
 
-	switch (ring->funcs->type) {
-	case AMDGPU_RING_TYPE_GFX:
-		timeout = adev->gfx_timeout;
-		break;
-	case AMDGPU_RING_TYPE_COMPUTE:
-		timeout = adev->compute_timeout;
-		break;
-	case AMDGPU_RING_TYPE_SDMA:
-		timeout = adev->sdma_timeout;
-		break;
-	default:
-		timeout = adev->video_timeout;
-		break;
-	}
-
-	r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
-			   num_hw_submission, amdgpu_job_hang_limit,
-			   timeout, NULL, sched_score, ring->name);
-	if (r) {
-		DRM_ERROR("Failed to create scheduler on ring %s.\n",
-			  ring->name);
-		return r;
-	}
+	if (!ring->fence_drv.fences)
+		return -ENOMEM;
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 4d380e79752c..a4b8279e3011 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -253,6 +253,8 @@ struct amdgpu_ring {
 	bool			has_compute_vm_bug;
 	bool			no_scheduler;
 	int			hw_prio;
+	unsigned 		num_hw_submission;
+	atomic_t		*sched_score;
 };
 
 #define amdgpu_ring_parse_cs(r, p, ib) ((r)->funcs->parse_cs((p), (ib)))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe
  2021-12-22 22:04 ` Andrey Grodzovsky
@ 2021-12-22 22:05   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:05 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: Monk.Liu, horace.chen, christian.koenig

Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 5527c68c51de..8ebd954e06c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
 		if (!ring || !ring->fence_drv.initialized)
 			continue;
 
-		if (!ring->no_scheduler) {
+		if (adev->in_suspend && !ring->no_scheduler) {
 			drm_sched_resubmit_jobs(&ring->sched);
 			drm_sched_start(&ring->sched, true);
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe
@ 2021-12-22 22:05   ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:05 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Monk.Liu, Andrey Grodzovsky, horace.chen, christian.koenig, daniel

Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 5527c68c51de..8ebd954e06c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
 		if (!ring || !ring->fence_drv.initialized)
 			continue;
 
-		if (!ring->no_scheduler) {
+		if (adev->in_suspend && !ring->no_scheduler) {
 			drm_sched_resubmit_jobs(&ring->sched);
 			drm_sched_start(&ring->sched, true);
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2021-12-22 22:04 ` Andrey Grodzovsky
@ 2021-12-22 22:05   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:05 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: Monk.Liu, horace.chen, christian.koenig

Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.

v2: Rename to amdgpu_recover_work_struct

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b5ff76aae7e0..8e96b9a14452 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
 bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
 int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job* job);
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+			      struct amdgpu_job *job);
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
 int amdgpu_device_pci_reset(struct amdgpu_device *adev);
 bool amdgpu_device_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7c063fd37389..258ec3c0b2af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
  * Returns 0 for success or an error on failure.
  */
 
-int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 			      struct amdgpu_job *job)
 {
 	struct list_head device_list, *device_list_handle =  NULL;
@@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	return r;
 }
 
+struct amdgpu_recover_work_struct {
+	struct work_struct base;
+	struct amdgpu_device *adev;
+	struct amdgpu_job *job;
+	int ret;
+};
+
+static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
+{
+	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
+
+	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+}
+/*
+ * Serialize gpu recover into reset domain single threaded wq
+ */
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+				    struct amdgpu_job *job)
+{
+	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
+
+	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
+
+	if (!queue_work(adev->reset_domain.wq, &work.base))
+		return -EAGAIN;
+
+	flush_work(&work.base);
+
+	return work.ret;
+}
+
 /**
  * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
  *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index bfc47bea23db..38c9fd7b7ad4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
 
 	if (amdgpu_device_should_recover_gpu(ring->adev)) {
-		amdgpu_device_gpu_recover(ring->adev, job);
+		amdgpu_device_gpu_recover_imp(ring->adev, job);
 	} else {
 		drm_sched_suspend_timeout(&ring->sched);
 		if (amdgpu_sriov_vf(adev))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2021-12-22 22:05   ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:05 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Monk.Liu, Andrey Grodzovsky, horace.chen, christian.koenig, daniel

Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.

v2: Rename to amdgpu_recover_work_struct

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b5ff76aae7e0..8e96b9a14452 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
 bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
 int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job* job);
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+			      struct amdgpu_job *job);
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
 int amdgpu_device_pci_reset(struct amdgpu_device *adev);
 bool amdgpu_device_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7c063fd37389..258ec3c0b2af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
  * Returns 0 for success or an error on failure.
  */
 
-int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 			      struct amdgpu_job *job)
 {
 	struct list_head device_list, *device_list_handle =  NULL;
@@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	return r;
 }
 
+struct amdgpu_recover_work_struct {
+	struct work_struct base;
+	struct amdgpu_device *adev;
+	struct amdgpu_job *job;
+	int ret;
+};
+
+static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
+{
+	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
+
+	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+}
+/*
+ * Serialize gpu recover into reset domain single threaded wq
+ */
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+				    struct amdgpu_job *job)
+{
+	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
+
+	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
+
+	if (!queue_work(adev->reset_domain.wq, &work.base))
+		return -EAGAIN;
+
+	flush_work(&work.base);
+
+	return work.ret;
+}
+
 /**
  * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
  *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index bfc47bea23db..38c9fd7b7ad4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
 
 	if (amdgpu_device_should_recover_gpu(ring->adev)) {
-		amdgpu_device_gpu_recover(ring->adev, job);
+		amdgpu_device_gpu_recover_imp(ring->adev, job);
 	} else {
 		drm_sched_suspend_timeout(&ring->sched);
 		if (amdgpu_sriov_vf(adev))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
  2021-12-22 22:04 ` Andrey Grodzovsky
@ 2021-12-22 22:13   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:13 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: horace.chen, christian.koenig, Monk.Liu, Liu Shaoyun

No need to to trigger another work queue inside the work queue.

Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 +++++--
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 +++++--
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 +++++--
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 23b066bcffb2..487cd654b69e 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -276,7 +276,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
 		adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -302,7 +302,10 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
 	switch (event) {
 		case IDH_FLR_NOTIFICATION:
 		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		case IDH_QUERY_ALIVE:
 			xgpu_ai_mailbox_send_ack(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index a35e6d87e537..e3869067a31d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -308,7 +308,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -337,7 +337,10 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
 	switch (event) {
 	case IDH_FLR_NOTIFICATION:
 		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		/* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can ignore
 		 * it byfar since that polling thread will handle it,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index aef9d059ae52..23e802cae2bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -551,7 +551,10 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 		/* only handle FLR_NOTIFY now */
 		if (!r)
-			schedule_work(&adev->virt.flr_work);
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 	}
 
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
@ 2021-12-22 22:13   ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:13 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, horace.chen, daniel, christian.koenig,
	Monk.Liu, Liu Shaoyun

No need to to trigger another work queue inside the work queue.

Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 +++++--
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 +++++--
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 +++++--
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 23b066bcffb2..487cd654b69e 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -276,7 +276,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
 		adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -302,7 +302,10 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
 	switch (event) {
 		case IDH_FLR_NOTIFICATION:
 		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		case IDH_QUERY_ALIVE:
 			xgpu_ai_mailbox_send_ack(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index a35e6d87e537..e3869067a31d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -308,7 +308,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -337,7 +337,10 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
 	switch (event) {
 	case IDH_FLR_NOTIFICATION:
 		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		/* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can ignore
 		 * it byfar since that polling thread will handle it,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index aef9d059ae52..23e802cae2bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -551,7 +551,10 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 		/* only handle FLR_NOTIFY now */
 		if (!r)
-			schedule_work(&adev->virt.flr_work);
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 	}
 
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset
  2021-12-22 22:13   ` Andrey Grodzovsky
@ 2021-12-22 22:13     ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:13 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: Monk.Liu, horace.chen, christian.koenig

Since we serialize all resets no need to protect from concurrent
resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +------------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 -
 3 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 258ec3c0b2af..107a393ebbfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5013,25 +5013,9 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 	dev_info(adev->dev, "GPU %s begin!\n",
 		need_emergency_restart ? "jobs stop":"reset");
 
-	/*
-	 * Here we trylock to avoid chain of resets executing from
-	 * either trigger by jobs on different adevs in XGMI hive or jobs on
-	 * different schedulers for same device while this TO handler is running.
-	 * We always reset all schedulers for device and all devices for XGMI
-	 * hive so that should take care of them too.
-	 */
 	hive = amdgpu_get_xgmi_hive(adev);
-	if (hive) {
-		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
-			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
-				job ? job->base.id : -1, hive->hive_id);
-			amdgpu_put_xgmi_hive(hive);
-			if (job && job->vm)
-				drm_sched_increase_karma(&job->base);
-			return 0;
-		}
+	if (hive)
 		mutex_lock(&hive->hive_lock);
-	}
 
 	reset_context.method = AMD_RESET_METHOD_NONE;
 	reset_context.reset_req_dev = adev;
@@ -5227,7 +5211,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 
 skip_recovery:
 	if (hive) {
-		atomic_set(&hive->in_reset, 0);
 		mutex_unlock(&hive->hive_lock);
 		amdgpu_put_xgmi_hive(hive);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index a858e3457c5c..9ad742039ac9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -404,7 +404,6 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
 	INIT_LIST_HEAD(&hive->device_list);
 	INIT_LIST_HEAD(&hive->node);
 	mutex_init(&hive->hive_lock);
-	atomic_set(&hive->in_reset, 0);
 	atomic_set(&hive->number_devices, 0);
 	task_barrier_init(&hive->tb);
 	hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index 6121aaa292cb..2f2ce53645a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -33,7 +33,6 @@ struct amdgpu_hive_info {
 	struct list_head node;
 	atomic_t number_devices;
 	struct mutex hive_lock;
-	atomic_t in_reset;
 	int hi_req_count;
 	struct amdgpu_device *hi_req_gpu;
 	struct task_barrier tb;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset
@ 2021-12-22 22:13     ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:13 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Monk.Liu, Andrey Grodzovsky, horace.chen, christian.koenig, daniel

Since we serialize all resets no need to protect from concurrent
resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +------------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 -
 3 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 258ec3c0b2af..107a393ebbfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5013,25 +5013,9 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 	dev_info(adev->dev, "GPU %s begin!\n",
 		need_emergency_restart ? "jobs stop":"reset");
 
-	/*
-	 * Here we trylock to avoid chain of resets executing from
-	 * either trigger by jobs on different adevs in XGMI hive or jobs on
-	 * different schedulers for same device while this TO handler is running.
-	 * We always reset all schedulers for device and all devices for XGMI
-	 * hive so that should take care of them too.
-	 */
 	hive = amdgpu_get_xgmi_hive(adev);
-	if (hive) {
-		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
-			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
-				job ? job->base.id : -1, hive->hive_id);
-			amdgpu_put_xgmi_hive(hive);
-			if (job && job->vm)
-				drm_sched_increase_karma(&job->base);
-			return 0;
-		}
+	if (hive)
 		mutex_lock(&hive->hive_lock);
-	}
 
 	reset_context.method = AMD_RESET_METHOD_NONE;
 	reset_context.reset_req_dev = adev;
@@ -5227,7 +5211,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 
 skip_recovery:
 	if (hive) {
-		atomic_set(&hive->in_reset, 0);
 		mutex_unlock(&hive->hive_lock);
 		amdgpu_put_xgmi_hive(hive);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index a858e3457c5c..9ad742039ac9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -404,7 +404,6 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
 	INIT_LIST_HEAD(&hive->device_list);
 	INIT_LIST_HEAD(&hive->node);
 	mutex_init(&hive->hive_lock);
-	atomic_set(&hive->in_reset, 0);
 	atomic_set(&hive->number_devices, 0);
 	task_barrier_init(&hive->tb);
 	hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index 6121aaa292cb..2f2ce53645a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -33,7 +33,6 @@ struct amdgpu_hive_info {
 	struct list_head node;
 	atomic_t number_devices;
 	struct mutex hive_lock;
-	atomic_t in_reset;
 	int hi_req_count;
 	struct amdgpu_device *hi_req_gpu;
 	struct task_barrier tb;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device
  2021-12-22 22:13   ` Andrey Grodzovsky
@ 2021-12-22 22:13     ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:13 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: Monk.Liu, horace.chen, christian.koenig

Since now all GPU resets are serialzied there is no need for this.

This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++--------------------
 1 file changed, 7 insertions(+), 82 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 107a393ebbfd..fef952ca8db5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4763,11 +4763,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 	return r;
 }
 
-static bool amdgpu_device_lock_adev(struct amdgpu_device *adev,
+static void amdgpu_device_lock_adev(struct amdgpu_device *adev,
 				struct amdgpu_hive_info *hive)
 {
-	if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
-		return false;
+	atomic_set(&adev->in_gpu_reset, 1);
 
 	if (hive) {
 		down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
@@ -4786,8 +4785,6 @@ static bool amdgpu_device_lock_adev(struct amdgpu_device *adev,
 		adev->mp1_state = PP_MP1_STATE_NONE;
 		break;
 	}
-
-	return true;
 }
 
 static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
@@ -4798,46 +4795,6 @@ static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
 	up_write(&adev->reset_sem);
 }
 
-/*
- * to lockup a list of amdgpu devices in a hive safely, if not a hive
- * with multiple nodes, it will be similar as amdgpu_device_lock_adev.
- *
- * unlock won't require roll back.
- */
-static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgpu_hive_info *hive)
-{
-	struct amdgpu_device *tmp_adev = NULL;
-
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive) {
-			dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
-			return -ENODEV;
-		}
-		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
-			if (!amdgpu_device_lock_adev(tmp_adev, hive))
-				goto roll_back;
-		}
-	} else if (!amdgpu_device_lock_adev(adev, hive))
-		return -EAGAIN;
-
-	return 0;
-roll_back:
-	if (!list_is_first(&tmp_adev->gmc.xgmi.head, &hive->device_list)) {
-		/*
-		 * if the lockup iteration break in the middle of a hive,
-		 * it may means there may has a race issue,
-		 * or a hive device locked up independently.
-		 * we may be in trouble and may not, so will try to roll back
-		 * the lock and give out a warnning.
-		 */
-		dev_warn(tmp_adev->dev, "Hive lock iteration broke in the middle. Rolling back to unlock");
-		list_for_each_entry_continue_reverse(tmp_adev, &hive->device_list, gmc.xgmi.head) {
-			amdgpu_device_unlock_adev(tmp_adev);
-		}
-	}
-	return -EAGAIN;
-}
-
 static void amdgpu_device_resume_display_audio(struct amdgpu_device *adev)
 {
 	struct pci_dev *p = NULL;
@@ -5023,22 +4980,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 	reset_context.hive = hive;
 	clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
 
-	/*
-	 * lock the device before we try to operate the linked list
-	 * if didn't get the device lock, don't touch the linked list since
-	 * others may iterating it.
-	 */
-	r = amdgpu_device_lock_hive_adev(adev, hive);
-	if (r) {
-		dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
-					job ? job->base.id : -1);
-
-		/* even we skipped this reset, still need to set the job to guilty */
-		if (job && job->vm)
-			drm_sched_increase_karma(&job->base);
-		goto skip_recovery;
-	}
-
 	/*
 	 * Build list of devices to reset.
 	 * In case we are in XGMI hive mode, resort the device list
@@ -5058,6 +4999,9 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
+
+		amdgpu_device_lock_adev(tmp_adev, hive);
+
 		/*
 		 * Try to put the audio codec into suspend state
 		 * before gpu reset started.
@@ -5209,13 +5153,12 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 		amdgpu_device_unlock_adev(tmp_adev);
 	}
 
-skip_recovery:
 	if (hive) {
 		mutex_unlock(&hive->hive_lock);
 		amdgpu_put_xgmi_hive(hive);
 	}
 
-	if (r && r != -EAGAIN)
+	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
 	return r;
 }
@@ -5438,20 +5381,6 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 	return 0;
 }
 
-static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
-{
-	int i;
-
-	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-		struct amdgpu_ring *ring = adev->rings[i];
-
-		if (!ring || !ring->sched.thread)
-			continue;
-
-		cancel_delayed_work_sync(&ring->sched.work_tdr);
-	}
-}
-
 /**
  * amdgpu_pci_error_detected - Called when a PCI error is detected.
  * @pdev: PCI device struct
@@ -5482,14 +5411,10 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 	/* Fatal error, prepare for slot reset */
 	case pci_channel_io_frozen:
 		/*
-		 * Cancel and wait for all TDRs in progress if failing to
-		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
-		 *
 		 * Locking adev->reset_sem will prevent any external access
 		 * to GPU during PCI error recovery
 		 */
-		while (!amdgpu_device_lock_adev(adev, NULL))
-			amdgpu_cancel_all_tdr(adev);
+		amdgpu_device_lock_adev(adev, NULL);
 
 		/*
 		 * Block any work scheduling as we do for regular GPU reset
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device
@ 2021-12-22 22:13     ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:13 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Monk.Liu, Andrey Grodzovsky, horace.chen, christian.koenig, daniel

Since now all GPU resets are serialzied there is no need for this.

This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++--------------------
 1 file changed, 7 insertions(+), 82 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 107a393ebbfd..fef952ca8db5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4763,11 +4763,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 	return r;
 }
 
-static bool amdgpu_device_lock_adev(struct amdgpu_device *adev,
+static void amdgpu_device_lock_adev(struct amdgpu_device *adev,
 				struct amdgpu_hive_info *hive)
 {
-	if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
-		return false;
+	atomic_set(&adev->in_gpu_reset, 1);
 
 	if (hive) {
 		down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
@@ -4786,8 +4785,6 @@ static bool amdgpu_device_lock_adev(struct amdgpu_device *adev,
 		adev->mp1_state = PP_MP1_STATE_NONE;
 		break;
 	}
-
-	return true;
 }
 
 static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
@@ -4798,46 +4795,6 @@ static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
 	up_write(&adev->reset_sem);
 }
 
-/*
- * to lockup a list of amdgpu devices in a hive safely, if not a hive
- * with multiple nodes, it will be similar as amdgpu_device_lock_adev.
- *
- * unlock won't require roll back.
- */
-static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgpu_hive_info *hive)
-{
-	struct amdgpu_device *tmp_adev = NULL;
-
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive) {
-			dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
-			return -ENODEV;
-		}
-		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
-			if (!amdgpu_device_lock_adev(tmp_adev, hive))
-				goto roll_back;
-		}
-	} else if (!amdgpu_device_lock_adev(adev, hive))
-		return -EAGAIN;
-
-	return 0;
-roll_back:
-	if (!list_is_first(&tmp_adev->gmc.xgmi.head, &hive->device_list)) {
-		/*
-		 * if the lockup iteration break in the middle of a hive,
-		 * it may means there may has a race issue,
-		 * or a hive device locked up independently.
-		 * we may be in trouble and may not, so will try to roll back
-		 * the lock and give out a warnning.
-		 */
-		dev_warn(tmp_adev->dev, "Hive lock iteration broke in the middle. Rolling back to unlock");
-		list_for_each_entry_continue_reverse(tmp_adev, &hive->device_list, gmc.xgmi.head) {
-			amdgpu_device_unlock_adev(tmp_adev);
-		}
-	}
-	return -EAGAIN;
-}
-
 static void amdgpu_device_resume_display_audio(struct amdgpu_device *adev)
 {
 	struct pci_dev *p = NULL;
@@ -5023,22 +4980,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 	reset_context.hive = hive;
 	clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
 
-	/*
-	 * lock the device before we try to operate the linked list
-	 * if didn't get the device lock, don't touch the linked list since
-	 * others may iterating it.
-	 */
-	r = amdgpu_device_lock_hive_adev(adev, hive);
-	if (r) {
-		dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
-					job ? job->base.id : -1);
-
-		/* even we skipped this reset, still need to set the job to guilty */
-		if (job && job->vm)
-			drm_sched_increase_karma(&job->base);
-		goto skip_recovery;
-	}
-
 	/*
 	 * Build list of devices to reset.
 	 * In case we are in XGMI hive mode, resort the device list
@@ -5058,6 +4999,9 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
+
+		amdgpu_device_lock_adev(tmp_adev, hive);
+
 		/*
 		 * Try to put the audio codec into suspend state
 		 * before gpu reset started.
@@ -5209,13 +5153,12 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 		amdgpu_device_unlock_adev(tmp_adev);
 	}
 
-skip_recovery:
 	if (hive) {
 		mutex_unlock(&hive->hive_lock);
 		amdgpu_put_xgmi_hive(hive);
 	}
 
-	if (r && r != -EAGAIN)
+	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
 	return r;
 }
@@ -5438,20 +5381,6 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
 	return 0;
 }
 
-static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
-{
-	int i;
-
-	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-		struct amdgpu_ring *ring = adev->rings[i];
-
-		if (!ring || !ring->sched.thread)
-			continue;
-
-		cancel_delayed_work_sync(&ring->sched.work_tdr);
-	}
-}
-
 /**
  * amdgpu_pci_error_detected - Called when a PCI error is detected.
  * @pdev: PCI device struct
@@ -5482,14 +5411,10 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 	/* Fatal error, prepare for slot reset */
 	case pci_channel_io_frozen:
 		/*
-		 * Cancel and wait for all TDRs in progress if failing to
-		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
-		 *
 		 * Locking adev->reset_sem will prevent any external access
 		 * to GPU during PCI error recovery
 		 */
-		while (!amdgpu_device_lock_adev(adev, NULL))
-			amdgpu_cancel_all_tdr(adev);
+		amdgpu_device_lock_adev(adev, NULL);
 
 		/*
 		 * Block any work scheduling as we do for regular GPU reset
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-22 22:13   ` Andrey Grodzovsky
@ 2021-12-22 22:14     ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:14 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: Monk.Liu, horace.chen, christian.koenig

Since now flr work is serialized against  GPU resets
there is no need for this.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-22 22:14     ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-22 22:14 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Monk.Liu, Andrey Grodzovsky, horace.chen, christian.koenig, daniel

Since now flr work is serialized against  GPU resets
there is no need for this.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready
  2021-12-22 22:05   ` Andrey Grodzovsky
@ 2021-12-23  8:39     ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:39 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu

Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:
> Before we initialize schedulers we must know which reset
> domain are we in - for single device there iis a single
> domain per device and so single wq per device. For XGMI
> the reset domain spans the entire XGMI hive and so the
> reset wq is per hive.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>   3 files changed, 51 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0f3e6c078f88..7c063fd37389 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2284,6 +2284,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device *adev)
>   	return r;
>   }
>   
> +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
> +{
> +	long timeout;
> +	int r, i;
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		/* No need to setup the GPU scheduler for rings that don't need it */
> +		if (!ring || ring->no_scheduler)
> +			continue;
> +
> +		switch (ring->funcs->type) {
> +		case AMDGPU_RING_TYPE_GFX:
> +			timeout = adev->gfx_timeout;
> +			break;
> +		case AMDGPU_RING_TYPE_COMPUTE:
> +			timeout = adev->compute_timeout;
> +			break;
> +		case AMDGPU_RING_TYPE_SDMA:
> +			timeout = adev->sdma_timeout;
> +			break;
> +		default:
> +			timeout = adev->video_timeout;
> +			break;
> +		}
> +
> +		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
> +				   ring->num_hw_submission, amdgpu_job_hang_limit,
> +				   timeout, adev->reset_domain.wq, ring->sched_score, ring->name);
> +		if (r) {
> +			DRM_ERROR("Failed to create scheduler on ring %s.\n",
> +				  ring->name);
> +			return r;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +
>   /**
>    * amdgpu_device_ip_init - run init for hardware IPs
>    *
> @@ -2412,6 +2453,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>   		}
>   	}
>   
> +	r = amdgpu_device_init_schedulers(adev);
> +	if (r)
> +		goto init_failed;
> +
>   	/* Don't init kfd if whole hive need to be reset during init */
>   	if (!adev->gmc.xgmi.pending_reset)
>   		amdgpu_amdkfd_device_init(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 3b7e86ea7167..5527c68c51de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -456,8 +456,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>   				  atomic_t *sched_score)
>   {
>   	struct amdgpu_device *adev = ring->adev;
> -	long timeout;
> -	int r;
>   
>   	if (!adev)
>   		return -EINVAL;
> @@ -477,36 +475,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>   	spin_lock_init(&ring->fence_drv.lock);
>   	ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
>   					 GFP_KERNEL);
> -	if (!ring->fence_drv.fences)
> -		return -ENOMEM;
>   
> -	/* No need to setup the GPU scheduler for rings that don't need it */
> -	if (ring->no_scheduler)
> -		return 0;
> +	ring->num_hw_submission = num_hw_submission;
> +	ring->sched_score = sched_score;
>   
> -	switch (ring->funcs->type) {
> -	case AMDGPU_RING_TYPE_GFX:
> -		timeout = adev->gfx_timeout;
> -		break;
> -	case AMDGPU_RING_TYPE_COMPUTE:
> -		timeout = adev->compute_timeout;
> -		break;
> -	case AMDGPU_RING_TYPE_SDMA:
> -		timeout = adev->sdma_timeout;
> -		break;
> -	default:
> -		timeout = adev->video_timeout;
> -		break;
> -	}
> -
> -	r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
> -			   num_hw_submission, amdgpu_job_hang_limit,
> -			   timeout, NULL, sched_score, ring->name);
> -	if (r) {
> -		DRM_ERROR("Failed to create scheduler on ring %s.\n",
> -			  ring->name);
> -		return r;
> -	}
> +	if (!ring->fence_drv.fences)
> +		return -ENOMEM;
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 4d380e79752c..a4b8279e3011 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -253,6 +253,8 @@ struct amdgpu_ring {
>   	bool			has_compute_vm_bug;
>   	bool			no_scheduler;
>   	int			hw_prio;
> +	unsigned 		num_hw_submission;
> +	atomic_t		*sched_score;
>   };
>   
>   #define amdgpu_ring_parse_cs(r, p, ib) ((r)->funcs->parse_cs((p), (ib)))


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready
@ 2021-12-23  8:39     ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:39 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu, daniel

Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:
> Before we initialize schedulers we must know which reset
> domain are we in - for single device there iis a single
> domain per device and so single wq per device. For XGMI
> the reset domain spans the entire XGMI hive and so the
> reset wq is per hive.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>   3 files changed, 51 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0f3e6c078f88..7c063fd37389 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2284,6 +2284,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device *adev)
>   	return r;
>   }
>   
> +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
> +{
> +	long timeout;
> +	int r, i;
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		/* No need to setup the GPU scheduler for rings that don't need it */
> +		if (!ring || ring->no_scheduler)
> +			continue;
> +
> +		switch (ring->funcs->type) {
> +		case AMDGPU_RING_TYPE_GFX:
> +			timeout = adev->gfx_timeout;
> +			break;
> +		case AMDGPU_RING_TYPE_COMPUTE:
> +			timeout = adev->compute_timeout;
> +			break;
> +		case AMDGPU_RING_TYPE_SDMA:
> +			timeout = adev->sdma_timeout;
> +			break;
> +		default:
> +			timeout = adev->video_timeout;
> +			break;
> +		}
> +
> +		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
> +				   ring->num_hw_submission, amdgpu_job_hang_limit,
> +				   timeout, adev->reset_domain.wq, ring->sched_score, ring->name);
> +		if (r) {
> +			DRM_ERROR("Failed to create scheduler on ring %s.\n",
> +				  ring->name);
> +			return r;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +
>   /**
>    * amdgpu_device_ip_init - run init for hardware IPs
>    *
> @@ -2412,6 +2453,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>   		}
>   	}
>   
> +	r = amdgpu_device_init_schedulers(adev);
> +	if (r)
> +		goto init_failed;
> +
>   	/* Don't init kfd if whole hive need to be reset during init */
>   	if (!adev->gmc.xgmi.pending_reset)
>   		amdgpu_amdkfd_device_init(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 3b7e86ea7167..5527c68c51de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -456,8 +456,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>   				  atomic_t *sched_score)
>   {
>   	struct amdgpu_device *adev = ring->adev;
> -	long timeout;
> -	int r;
>   
>   	if (!adev)
>   		return -EINVAL;
> @@ -477,36 +475,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>   	spin_lock_init(&ring->fence_drv.lock);
>   	ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
>   					 GFP_KERNEL);
> -	if (!ring->fence_drv.fences)
> -		return -ENOMEM;
>   
> -	/* No need to setup the GPU scheduler for rings that don't need it */
> -	if (ring->no_scheduler)
> -		return 0;
> +	ring->num_hw_submission = num_hw_submission;
> +	ring->sched_score = sched_score;
>   
> -	switch (ring->funcs->type) {
> -	case AMDGPU_RING_TYPE_GFX:
> -		timeout = adev->gfx_timeout;
> -		break;
> -	case AMDGPU_RING_TYPE_COMPUTE:
> -		timeout = adev->compute_timeout;
> -		break;
> -	case AMDGPU_RING_TYPE_SDMA:
> -		timeout = adev->sdma_timeout;
> -		break;
> -	default:
> -		timeout = adev->video_timeout;
> -		break;
> -	}
> -
> -	r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
> -			   num_hw_submission, amdgpu_job_hang_limit,
> -			   timeout, NULL, sched_score, ring->name);
> -	if (r) {
> -		DRM_ERROR("Failed to create scheduler on ring %s.\n",
> -			  ring->name);
> -		return r;
> -	}
> +	if (!ring->fence_drv.fences)
> +		return -ENOMEM;
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 4d380e79752c..a4b8279e3011 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -253,6 +253,8 @@ struct amdgpu_ring {
>   	bool			has_compute_vm_bug;
>   	bool			no_scheduler;
>   	int			hw_prio;
> +	unsigned 		num_hw_submission;
> +	atomic_t		*sched_score;
>   };
>   
>   #define amdgpu_ring_parse_cs(r, p, ib) ((r)->funcs->parse_cs((p), (ib)))


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe
  2021-12-22 22:05   ` Andrey Grodzovsky
@ 2021-12-23  8:40     ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:40 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu



Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:
> Restrict jobs resubmission to suspend case
> only since schedulers not initialised yet on
> probe.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 5527c68c51de..8ebd954e06c6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
>   		if (!ring || !ring->fence_drv.initialized)
>   			continue;
>   
> -		if (!ring->no_scheduler) {
> +		if (adev->in_suspend && !ring->no_scheduler) {

Please add a TODO comment, something like "restructure resume to make 
that unnecessary".

With that done the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com> as well.

Christian.

>   			drm_sched_resubmit_jobs(&ring->sched);
>   			drm_sched_start(&ring->sched, true);
>   		}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe
@ 2021-12-23  8:40     ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:40 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu, daniel



Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:
> Restrict jobs resubmission to suspend case
> only since schedulers not initialised yet on
> probe.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 5527c68c51de..8ebd954e06c6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
>   		if (!ring || !ring->fence_drv.initialized)
>   			continue;
>   
> -		if (!ring->no_scheduler) {
> +		if (adev->in_suspend && !ring->no_scheduler) {

Please add a TODO comment, something like "restructure resume to make 
that unnecessary".

With that done the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com> as well.

Christian.

>   			drm_sched_resubmit_jobs(&ring->sched);
>   			drm_sched_start(&ring->sched, true);
>   		}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2021-12-22 22:05   ` Andrey Grodzovsky
@ 2021-12-23  8:41     ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:41 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu

Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:
> Use reset domain wq also for non TDR gpu recovery trigers
> such as sysfs and RAS. We must serialize all possible
> GPU recoveries to gurantee no concurrency there.
> For TDR call the original recovery function directly since
> it's already executed from within the wq. For others just
> use a wrapper to qeueue work and wait on it to finish.
>
> v2: Rename to amdgpu_recover_work_struct
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>   3 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index b5ff76aae7e0..8e96b9a14452 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			      struct amdgpu_job* job);
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> +			      struct amdgpu_job *job);
>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7c063fd37389..258ec3c0b2af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>    * Returns 0 for success or an error on failure.
>    */
>   
> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   			      struct amdgpu_job *job)
>   {
>   	struct list_head device_list, *device_list_handle =  NULL;
> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +struct amdgpu_recover_work_struct {
> +	struct work_struct base;
> +	struct amdgpu_device *adev;
> +	struct amdgpu_job *job;
> +	int ret;
> +};
> +
> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> +{
> +	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> +
> +	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +}
> +/*
> + * Serialize gpu recover into reset domain single threaded wq
> + */
> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +				    struct amdgpu_job *job)
> +{
> +	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
> +
> +	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> +
> +	if (!queue_work(adev->reset_domain.wq, &work.base))
> +		return -EAGAIN;
> +
> +	flush_work(&work.base);
> +
> +	return work.ret;
> +}
> +
>   /**
>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>    *
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index bfc47bea23db..38c9fd7b7ad4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
> -		amdgpu_device_gpu_recover(ring->adev, job);
> +		amdgpu_device_gpu_recover_imp(ring->adev, job);
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2021-12-23  8:41     ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:41 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu, daniel

Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:
> Use reset domain wq also for non TDR gpu recovery trigers
> such as sysfs and RAS. We must serialize all possible
> GPU recoveries to gurantee no concurrency there.
> For TDR call the original recovery function directly since
> it's already executed from within the wq. For others just
> use a wrapper to qeueue work and wait on it to finish.
>
> v2: Rename to amdgpu_recover_work_struct
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>   3 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index b5ff76aae7e0..8e96b9a14452 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			      struct amdgpu_job* job);
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> +			      struct amdgpu_job *job);
>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7c063fd37389..258ec3c0b2af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>    * Returns 0 for success or an error on failure.
>    */
>   
> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   			      struct amdgpu_job *job)
>   {
>   	struct list_head device_list, *device_list_handle =  NULL;
> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +struct amdgpu_recover_work_struct {
> +	struct work_struct base;
> +	struct amdgpu_device *adev;
> +	struct amdgpu_job *job;
> +	int ret;
> +};
> +
> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> +{
> +	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> +
> +	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +}
> +/*
> + * Serialize gpu recover into reset domain single threaded wq
> + */
> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +				    struct amdgpu_job *job)
> +{
> +	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
> +
> +	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> +
> +	if (!queue_work(adev->reset_domain.wq, &work.base))
> +		return -EAGAIN;
> +
> +	flush_work(&work.base);
> +
> +	return work.ret;
> +}
> +
>   /**
>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>    *
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index bfc47bea23db..38c9fd7b7ad4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
> -		amdgpu_device_gpu_recover(ring->adev, job);
> +		amdgpu_device_gpu_recover_imp(ring->adev, job);
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-22 22:14     ` Andrey Grodzovsky
@ 2021-12-23  8:42       ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:42 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
> Since now flr work is serialized against  GPU resets
> there is no need for this.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>   2 files changed, 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 487cd654b69e..7d59a66e3988 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) ||
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index e3869067a31d..f82c066c8e8d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-23  8:42       ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2021-12-23  8:42 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu, daniel

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
> Since now flr work is serialized against  GPU resets
> there is no need for this.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>   2 files changed, 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 487cd654b69e..7d59a66e3988 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) ||
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index e3869067a31d..f82c066c8e8d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-23  8:42       ` Christian König
@ 2021-12-23 10:14         ` Liu, Monk
  -1 siblings, 0 replies; 103+ messages in thread
From: Liu, Monk @ 2021-12-23 10:14 UTC (permalink / raw)
  To: Koenig, Christian, Grodzovsky, Andrey, dri-devel, amd-gfx, Chen,
	Horace, Chen,  JingWen, Deng, Emily

[AMD Official Use Only]

@Chen, Horace @Chen, JingWen @Deng, Emily

Please take a review on Andrey's patch 

Thanks 
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Solution | AMD
-------------------------------------------------------------------
we are hiring software manager for CVS core team
-------------------------------------------------------------------

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Thursday, December 23, 2021 4:42 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace <Horace.Chen@amd.com>
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
> Since now flr work is serialized against  GPU resets there is no need 
> for this.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>   2 files changed, 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 487cd654b69e..7d59a66e3988 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) || diff --git 
> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index e3869067a31d..f82c066c8e8d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-23 10:14         ` Liu, Monk
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Monk @ 2021-12-23 10:14 UTC (permalink / raw)
  To: Koenig, Christian, Grodzovsky, Andrey, dri-devel, amd-gfx, Chen,
	Horace, Chen,  JingWen, Deng, Emily
  Cc: daniel

[AMD Official Use Only]

@Chen, Horace @Chen, JingWen @Deng, Emily

Please take a review on Andrey's patch 

Thanks 
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Solution | AMD
-------------------------------------------------------------------
we are hiring software manager for CVS core team
-------------------------------------------------------------------

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Thursday, December 23, 2021 4:42 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace <Horace.Chen@amd.com>
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
> Since now flr work is serialized against  GPU resets there is no need 
> for this.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>   2 files changed, 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 487cd654b69e..7d59a66e3988 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) || diff --git 
> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index e3869067a31d..f82c066c8e8d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (!down_write_trylock(&adev->reset_sem))
> -		return;
> -
>   	amdgpu_virt_fini_data_exchange(adev);
> -	atomic_set(&adev->in_gpu_reset, 1);
>   
>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   	} while (timeout > 1);
>   
>   flr_done:
> -	atomic_set(&adev->in_gpu_reset, 0);
> -	up_write(&adev->reset_sem);
> -
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-22 22:14     ` Andrey Grodzovsky
@ 2021-12-23 18:07       ` Liu, Shaoyun
  -1 siblings, 0 replies; 103+ messages in thread
From: Liu, Shaoyun @ 2021-12-23 18:07 UTC (permalink / raw)
  To: Grodzovsky, Andrey, dri-devel, amd-gfx
  Cc: Chen, Horace, Koenig, Christian, Liu, Monk

[AMD Official Use Only]

I have  a discussion with  Andrey  about this offline.   It seems dangerous  to remove the in_gpu_reset and  reset_semm directly inside the  flr_work.  In the case when the reset is triggered from host side , gpu need to be locked while host perform reset after flr_work reply the host with  READY_TO_RESET. 
The original comments seems need to be updated. 

Regards
Shaoyun.liu
 

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Andrey Grodzovsky
Sent: Wednesday, December 22, 2021 5:14 PM
To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; daniel@ffwll.ch
Subject: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Since now flr work is serialized against  GPU resets there is no need for this.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) || diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
--
2.25.1

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-23 18:07       ` Liu, Shaoyun
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Shaoyun @ 2021-12-23 18:07 UTC (permalink / raw)
  To: Grodzovsky, Andrey, dri-devel, amd-gfx
  Cc: daniel, Grodzovsky, Andrey, Chen, Horace, Koenig,  Christian, Liu, Monk

[AMD Official Use Only]

I have  a discussion with  Andrey  about this offline.   It seems dangerous  to remove the in_gpu_reset and  reset_semm directly inside the  flr_work.  In the case when the reset is triggered from host side , gpu need to be locked while host perform reset after flr_work reply the host with  READY_TO_RESET. 
The original comments seems need to be updated. 

Regards
Shaoyun.liu
 

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Andrey Grodzovsky
Sent: Wednesday, December 22, 2021 5:14 PM
To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; daniel@ffwll.ch
Subject: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Since now flr work is serialized against  GPU resets there is no need for this.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) || diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
--
2.25.1

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
  2021-12-22 22:13   ` Andrey Grodzovsky
@ 2021-12-23 18:29     ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-23 18:29 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, horace.chen, daniel, christian.koenig,
	Monk.Liu, Liu Shaoyun

No need to to trigger another work queue inside the work queue.

v3:

Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.

Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 9 ++++++---
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 9 ++++++---
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 9 ++++++---
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 23b066bcffb2..bdeb8e933bb4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -276,7 +276,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
 		adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -301,8 +301,11 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 	switch (event) {
 		case IDH_FLR_NOTIFICATION:
-		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		case IDH_QUERY_ALIVE:
 			xgpu_ai_mailbox_send_ack(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index a35e6d87e537..dd8dc0f6028c 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -308,7 +308,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -336,8 +336,11 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 	switch (event) {
 	case IDH_FLR_NOTIFICATION:
-		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		/* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can ignore
 		 * it byfar since that polling thread will handle it,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index aef9d059ae52..c2afb72f97ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -550,8 +550,11 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
 		r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
 
 		/* only handle FLR_NOTIFY now */
-		if (!r)
-			schedule_work(&adev->virt.flr_work);
+		if (!r && !amdgpu_in_reset(adev))
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 	}
 
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
@ 2021-12-23 18:29     ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-23 18:29 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: horace.chen, christian.koenig, Monk.Liu, Liu Shaoyun

No need to to trigger another work queue inside the work queue.

v3:

Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.

Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 9 ++++++---
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 9 ++++++---
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 9 ++++++---
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 23b066bcffb2..bdeb8e933bb4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -276,7 +276,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
 		adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -301,8 +301,11 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 	switch (event) {
 		case IDH_FLR_NOTIFICATION:
-		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		case IDH_QUERY_ALIVE:
 			xgpu_ai_mailbox_send_ack(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index a35e6d87e537..dd8dc0f6028c 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -308,7 +308,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -336,8 +336,11 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 	switch (event) {
 	case IDH_FLR_NOTIFICATION:
-		if (amdgpu_sriov_runtime(adev))
-			schedule_work(&adev->virt.flr_work);
+		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 		break;
 		/* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can ignore
 		 * it byfar since that polling thread will handle it,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index aef9d059ae52..c2afb72f97ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -550,8 +550,11 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
 		r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
 
 		/* only handle FLR_NOTIFY now */
-		if (!r)
-			schedule_work(&adev->virt.flr_work);
+		if (!r && !amdgpu_in_reset(adev))
+			WARN_ONCE(!queue_work(adev->reset_domain.wq,
+					      &adev->virt.flr_work),
+				  "Failed to queue work! at %s",
+				  __FUNCTION__ );
 	}
 
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-23 10:14         ` Liu, Monk
@ 2021-12-24  8:58           ` Deng, Emily
  -1 siblings, 0 replies; 103+ messages in thread
From: Deng, Emily @ 2021-12-24  8:58 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Grodzovsky, Andrey, dri-devel,
	amd-gfx, Chen, Horace, Chen, JingWen

These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.

Best wishes
Emily Deng



>-----Original Message-----
>From: Liu, Monk <Monk.Liu@amd.com>
>Sent: Thursday, December 23, 2021 6:14 PM
>To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
><Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>Cc: daniel@ffwll.ch
>Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>for SRIOV
>
>[AMD Official Use Only]
>
>@Chen, Horace @Chen, JingWen @Deng, Emily
>
>Please take a review on Andrey's patch
>
>Thanks
>-------------------------------------------------------------------
>Monk Liu | Cloud GPU & Virtualization Solution | AMD
>-------------------------------------------------------------------
>we are hiring software manager for CVS core team
>-------------------------------------------------------------------
>
>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@amd.com>
>Sent: Thursday, December 23, 2021 4:42 PM
>To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
><Horace.Chen@amd.com>
>Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>for SRIOV
>
>Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>> Since now flr work is serialized against  GPU resets there is no need
>> for this.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
>Acked-by: Christian König <christian.koenig@amd.com>
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>   2 files changed, 22 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> index 487cd654b69e..7d59a66e3988 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
>>   	struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>
>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>> -	 * otherwise the mailbox msg will be ruined/reseted by
>> -	 * the VF FLR.
>> -	 */
>> -	if (!down_write_trylock(&adev->reset_sem))
>> -		return;
>> -
>>   	amdgpu_virt_fini_data_exchange(adev);
>> -	atomic_set(&adev->in_gpu_reset, 1);
>>
>>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>
>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
>>   	} while (timeout > 1);
>>
>>   flr_done:
>> -	atomic_set(&adev->in_gpu_reset, 0);
>> -	up_write(&adev->reset_sem);
>> -
>>   	/* Trigger recovery for world switch failure if no TDR */
>>   	if (amdgpu_device_should_recover_gpu(adev)
>>   		&& (!amdgpu_device_has_job_running(adev) || diff --git
>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> index e3869067a31d..f82c066c8e8d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
>>   	struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>
>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>> -	 * otherwise the mailbox msg will be ruined/reseted by
>> -	 * the VF FLR.
>> -	 */
>> -	if (!down_write_trylock(&adev->reset_sem))
>> -		return;
>> -
>>   	amdgpu_virt_fini_data_exchange(adev);
>> -	atomic_set(&adev->in_gpu_reset, 1);
>>
>>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>
>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
>>   	} while (timeout > 1);
>>
>>   flr_done:
>> -	atomic_set(&adev->in_gpu_reset, 0);
>> -	up_write(&adev->reset_sem);
>> -
>>   	/* Trigger recovery for world switch failure if no TDR */
>>   	if (amdgpu_device_should_recover_gpu(adev)
>>   		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-24  8:58           ` Deng, Emily
  0 siblings, 0 replies; 103+ messages in thread
From: Deng, Emily @ 2021-12-24  8:58 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Grodzovsky, Andrey, dri-devel,
	amd-gfx, Chen, Horace, Chen, JingWen
  Cc: daniel

These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.

Best wishes
Emily Deng



>-----Original Message-----
>From: Liu, Monk <Monk.Liu@amd.com>
>Sent: Thursday, December 23, 2021 6:14 PM
>To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
><Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>Cc: daniel@ffwll.ch
>Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>for SRIOV
>
>[AMD Official Use Only]
>
>@Chen, Horace @Chen, JingWen @Deng, Emily
>
>Please take a review on Andrey's patch
>
>Thanks
>-------------------------------------------------------------------
>Monk Liu | Cloud GPU & Virtualization Solution | AMD
>-------------------------------------------------------------------
>we are hiring software manager for CVS core team
>-------------------------------------------------------------------
>
>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@amd.com>
>Sent: Thursday, December 23, 2021 4:42 PM
>To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
><Horace.Chen@amd.com>
>Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>for SRIOV
>
>Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>> Since now flr work is serialized against  GPU resets there is no need
>> for this.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
>Acked-by: Christian König <christian.koenig@amd.com>
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>   2 files changed, 22 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> index 487cd654b69e..7d59a66e3988 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
>>   	struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>
>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>> -	 * otherwise the mailbox msg will be ruined/reseted by
>> -	 * the VF FLR.
>> -	 */
>> -	if (!down_write_trylock(&adev->reset_sem))
>> -		return;
>> -
>>   	amdgpu_virt_fini_data_exchange(adev);
>> -	atomic_set(&adev->in_gpu_reset, 1);
>>
>>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>
>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
>>   	} while (timeout > 1);
>>
>>   flr_done:
>> -	atomic_set(&adev->in_gpu_reset, 0);
>> -	up_write(&adev->reset_sem);
>> -
>>   	/* Trigger recovery for world switch failure if no TDR */
>>   	if (amdgpu_device_should_recover_gpu(adev)
>>   		&& (!amdgpu_device_has_job_running(adev) || diff --git
>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> index e3869067a31d..f82c066c8e8d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
>>   	struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>
>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>> -	 * otherwise the mailbox msg will be ruined/reseted by
>> -	 * the VF FLR.
>> -	 */
>> -	if (!down_write_trylock(&adev->reset_sem))
>> -		return;
>> -
>>   	amdgpu_virt_fini_data_exchange(adev);
>> -	atomic_set(&adev->in_gpu_reset, 1);
>>
>>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>
>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
>>   	} while (timeout > 1);
>>
>>   flr_done:
>> -	atomic_set(&adev->in_gpu_reset, 0);
>> -	up_write(&adev->reset_sem);
>> -
>>   	/* Trigger recovery for world switch failure if no TDR */
>>   	if (amdgpu_device_should_recover_gpu(adev)
>>   		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-24  8:58           ` Deng, Emily
@ 2021-12-24  9:57             ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2021-12-24  9:57 UTC (permalink / raw)
  To: Deng, Emily, Liu, Monk, Koenig, Christian, Grodzovsky, Andrey,
	dri-devel, amd-gfx, Chen, Horace, Chen,  JingWen

I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.

On 2021/12/24 下午4:58, Deng, Emily wrote:
> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Liu, Monk <Monk.Liu@amd.com>
>> Sent: Thursday, December 23, 2021 6:14 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> [AMD Official Use Only]
>>
>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>
>> Please take a review on Andrey's patch
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, December 23, 2021 4:42 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>> <Horace.Chen@amd.com>
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>> Since now flr work is serialized against  GPU resets there is no need
>>> for this.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>   2 files changed, 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> index 487cd654b69e..7d59a66e3988 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> index e3869067a31d..f82c066c8e8d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-24  9:57             ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2021-12-24  9:57 UTC (permalink / raw)
  To: Deng, Emily, Liu, Monk, Koenig, Christian, Grodzovsky, Andrey,
	dri-devel, amd-gfx, Chen, Horace, Chen,  JingWen
  Cc: daniel

I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.

On 2021/12/24 下午4:58, Deng, Emily wrote:
> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Liu, Monk <Monk.Liu@amd.com>
>> Sent: Thursday, December 23, 2021 6:14 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> [AMD Official Use Only]
>>
>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>
>> Please take a review on Andrey's patch
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, December 23, 2021 4:42 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>> <Horace.Chen@amd.com>
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>> Since now flr work is serialized against  GPU resets there is no need
>>> for this.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>   2 files changed, 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> index 487cd654b69e..7d59a66e3988 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> index e3869067a31d..f82c066c8e8d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-24  8:58           ` Deng, Emily
@ 2021-12-30 18:39             ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-30 18:39 UTC (permalink / raw)
  To: Deng, Emily, Liu, Monk, Koenig, Christian, dri-devel, amd-gfx,
	Chen, Horace, Chen, JingWen

Thanks a lot, please let me know.

Andrey

On 2021-12-24 3:58 a.m., Deng, Emily wrote:
> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Liu, Monk <Monk.Liu@amd.com>
>> Sent: Thursday, December 23, 2021 6:14 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> [AMD Official Use Only]
>>
>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>
>> Please take a review on Andrey's patch
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, December 23, 2021 4:42 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>> <Horace.Chen@amd.com>
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>> Since now flr work is serialized against  GPU resets there is no need
>>> for this.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>    2 files changed, 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> index 487cd654b69e..7d59a66e3988 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>    	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>    	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>    	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>    	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>    	} while (timeout > 1);
>>>
>>>    flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>    		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> index e3869067a31d..f82c066c8e8d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>    	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>    	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>    	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>    	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>    	} while (timeout > 1);
>>>
>>>    flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>    		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-30 18:39             ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-30 18:39 UTC (permalink / raw)
  To: Deng, Emily, Liu, Monk, Koenig, Christian, dri-devel, amd-gfx,
	Chen, Horace, Chen, JingWen
  Cc: daniel

Thanks a lot, please let me know.

Andrey

On 2021-12-24 3:58 a.m., Deng, Emily wrote:
> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Liu, Monk <Monk.Liu@amd.com>
>> Sent: Thursday, December 23, 2021 6:14 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> [AMD Official Use Only]
>>
>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>
>> Please take a review on Andrey's patch
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, December 23, 2021 4:42 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>> <Horace.Chen@amd.com>
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>> Since now flr work is serialized against  GPU resets there is no need
>>> for this.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>    2 files changed, 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> index 487cd654b69e..7d59a66e3988 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>    	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>    	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>    	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>    	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>    	} while (timeout > 1);
>>>
>>>    flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>    		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> index e3869067a31d..f82c066c8e8d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>    	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>    	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>    	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>    	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>    	} while (timeout > 1);
>>>
>>>    flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>    		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-24  9:57             ` JingWen Chen
@ 2021-12-30 18:45               ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-30 18:45 UTC (permalink / raw)
  To: JingWen Chen, Deng, Emily, Liu, Monk, Koenig, Christian,
	dri-devel, amd-gfx, Chen, Horace, Chen, JingWen

Sure, I guess i can drop this patch then.

Andrey

On 2021-12-24 4:57 a.m., JingWen Chen wrote:
> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>
> On 2021/12/24 下午4:58, Deng, Emily wrote:
>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Liu, Monk <Monk.Liu@amd.com>
>>> Sent: Thursday, December 23, 2021 6:14 PM
>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>> Cc: daniel@ffwll.ch
>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>> for SRIOV
>>>
>>> [AMD Official Use Only]
>>>
>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>
>>> Please take a review on Andrey's patch
>>>
>>> Thanks
>>> -------------------------------------------------------------------
>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>> -------------------------------------------------------------------
>>> we are hiring software manager for CVS core team
>>> -------------------------------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Thursday, December 23, 2021 4:42 PM
>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>> <Horace.Chen@amd.com>
>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>> for SRIOV
>>>
>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>> Since now flr work is serialized against  GPU resets there is no need
>>>> for this.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>    2 files changed, 22 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> index 487cd654b69e..7d59a66e3988 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	struct amdgpu_device *adev = container_of(virt, struct
>>> amdgpu_device, virt);
>>>>    	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>
>>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>>> -	 * the VF FLR.
>>>> -	 */
>>>> -	if (!down_write_trylock(&adev->reset_sem))
>>>> -		return;
>>>> -
>>>>    	amdgpu_virt_fini_data_exchange(adev);
>>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>>
>>>>    	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>
>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	} while (timeout > 1);
>>>>
>>>>    flr_done:
>>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>>> -	up_write(&adev->reset_sem);
>>>> -
>>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>>    		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> index e3869067a31d..f82c066c8e8d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	struct amdgpu_device *adev = container_of(virt, struct
>>> amdgpu_device, virt);
>>>>    	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>
>>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>>> -	 * the VF FLR.
>>>> -	 */
>>>> -	if (!down_write_trylock(&adev->reset_sem))
>>>> -		return;
>>>> -
>>>>    	amdgpu_virt_fini_data_exchange(adev);
>>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>>
>>>>    	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>
>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	} while (timeout > 1);
>>>>
>>>>    flr_done:
>>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>>> -	up_write(&adev->reset_sem);
>>>> -
>>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>>    		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2021-12-30 18:45               ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2021-12-30 18:45 UTC (permalink / raw)
  To: JingWen Chen, Deng, Emily, Liu, Monk, Koenig, Christian,
	dri-devel, amd-gfx, Chen, Horace, Chen, JingWen
  Cc: daniel

Sure, I guess i can drop this patch then.

Andrey

On 2021-12-24 4:57 a.m., JingWen Chen wrote:
> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>
> On 2021/12/24 下午4:58, Deng, Emily wrote:
>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Liu, Monk <Monk.Liu@amd.com>
>>> Sent: Thursday, December 23, 2021 6:14 PM
>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>> Cc: daniel@ffwll.ch
>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>> for SRIOV
>>>
>>> [AMD Official Use Only]
>>>
>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>
>>> Please take a review on Andrey's patch
>>>
>>> Thanks
>>> -------------------------------------------------------------------
>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>> -------------------------------------------------------------------
>>> we are hiring software manager for CVS core team
>>> -------------------------------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Thursday, December 23, 2021 4:42 PM
>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>> <Horace.Chen@amd.com>
>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>> for SRIOV
>>>
>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>> Since now flr work is serialized against  GPU resets there is no need
>>>> for this.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>    2 files changed, 22 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> index 487cd654b69e..7d59a66e3988 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	struct amdgpu_device *adev = container_of(virt, struct
>>> amdgpu_device, virt);
>>>>    	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>
>>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>>> -	 * the VF FLR.
>>>> -	 */
>>>> -	if (!down_write_trylock(&adev->reset_sem))
>>>> -		return;
>>>> -
>>>>    	amdgpu_virt_fini_data_exchange(adev);
>>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>>
>>>>    	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>
>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	} while (timeout > 1);
>>>>
>>>>    flr_done:
>>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>>> -	up_write(&adev->reset_sem);
>>>> -
>>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>>    		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> index e3869067a31d..f82c066c8e8d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	struct amdgpu_device *adev = container_of(virt, struct
>>> amdgpu_device, virt);
>>>>    	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>
>>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>>> -	 * the VF FLR.
>>>> -	 */
>>>> -	if (!down_write_trylock(&adev->reset_sem))
>>>> -		return;
>>>> -
>>>>    	amdgpu_virt_fini_data_exchange(adev);
>>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>>
>>>>    	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>
>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>> work_struct *work)
>>>>    	} while (timeout > 1);
>>>>
>>>>    flr_done:
>>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>>> -	up_write(&adev->reset_sem);
>>>> -
>>>>    	/* Trigger recovery for world switch failure if no TDR */
>>>>    	if (amdgpu_device_should_recover_gpu(adev)
>>>>    		&& (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2021-12-30 18:45               ` Andrey Grodzovsky
@ 2022-01-03 10:17                 ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-03 10:17 UTC (permalink / raw)
  To: Andrey Grodzovsky, JingWen Chen, Deng, Emily, Liu, Monk, Koenig,
	Christian, dri-devel, amd-gfx, Chen, Horace, Chen, JingWen

Please don't. This patch is vital to the cleanup of the reset procedure.

If SRIOV doesn't work with that we need to change SRIOV and not the driver.

Christian.

Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
> Sure, I guess i can drop this patch then.
>
> Andrey
>
> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>> I do agree with shaoyun, if the host find the gpu engine hangs first, 
>> and do the flr, guest side thread may not know this and still try to 
>> access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to 
>> identify the reset status). And this may lead to very bad result.
>>
>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>> These patches look good to me. JingWen will pull these patches and 
>>> do some basic TDR test on sriov environment, and give feedback.
>>>
>>> Best wishes
>>> Emily Deng
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>> Cc: daniel@ffwll.ch
>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>> protection
>>>> for SRIOV
>>>>
>>>> [AMD Official Use Only]
>>>>
>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>
>>>> Please take a review on Andrey's patch
>>>>
>>>> Thanks
>>>> -------------------------------------------------------------------
>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>> -------------------------------------------------------------------
>>>> we are hiring software manager for CVS core team
>>>> -------------------------------------------------------------------
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>> <Horace.Chen@amd.com>
>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>> protection
>>>> for SRIOV
>>>>
>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>> Since now flr work is serialized against  GPU resets there is no need
>>>>> for this.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>    2 files changed, 22 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>> amdgpu_device, virt);
>>>>>        int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>
>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>> -     * the VF FLR.
>>>>> -     */
>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>> -        return;
>>>>> -
>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>
>>>>>        xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>
>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        } while (timeout > 1);
>>>>>
>>>>>    flr_done:
>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>> -    up_write(&adev->reset_sem);
>>>>> -
>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>            && (!amdgpu_device_has_job_running(adev) || diff --git
>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>> amdgpu_device, virt);
>>>>>        int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>
>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>> -     * the VF FLR.
>>>>> -     */
>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>> -        return;
>>>>> -
>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>
>>>>>        xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>
>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        } while (timeout > 1);
>>>>>
>>>>>    flr_done:
>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>> -    up_write(&adev->reset_sem);
>>>>> -
>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>            && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-03 10:17                 ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-03 10:17 UTC (permalink / raw)
  To: Andrey Grodzovsky, JingWen Chen, Deng, Emily, Liu, Monk, Koenig,
	Christian, dri-devel, amd-gfx, Chen, Horace, Chen, JingWen
  Cc: daniel

Please don't. This patch is vital to the cleanup of the reset procedure.

If SRIOV doesn't work with that we need to change SRIOV and not the driver.

Christian.

Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
> Sure, I guess i can drop this patch then.
>
> Andrey
>
> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>> I do agree with shaoyun, if the host find the gpu engine hangs first, 
>> and do the flr, guest side thread may not know this and still try to 
>> access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to 
>> identify the reset status). And this may lead to very bad result.
>>
>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>> These patches look good to me. JingWen will pull these patches and 
>>> do some basic TDR test on sriov environment, and give feedback.
>>>
>>> Best wishes
>>> Emily Deng
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>> Cc: daniel@ffwll.ch
>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>> protection
>>>> for SRIOV
>>>>
>>>> [AMD Official Use Only]
>>>>
>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>
>>>> Please take a review on Andrey's patch
>>>>
>>>> Thanks
>>>> -------------------------------------------------------------------
>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>> -------------------------------------------------------------------
>>>> we are hiring software manager for CVS core team
>>>> -------------------------------------------------------------------
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>> <Horace.Chen@amd.com>
>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>> protection
>>>> for SRIOV
>>>>
>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>> Since now flr work is serialized against  GPU resets there is no need
>>>>> for this.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>    2 files changed, 22 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>> amdgpu_device, virt);
>>>>>        int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>
>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>> -     * the VF FLR.
>>>>> -     */
>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>> -        return;
>>>>> -
>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>
>>>>>        xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>
>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        } while (timeout > 1);
>>>>>
>>>>>    flr_done:
>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>> -    up_write(&adev->reset_sem);
>>>>> -
>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>            && (!amdgpu_device_has_job_running(adev) || diff --git
>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>> amdgpu_device, virt);
>>>>>        int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>
>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>> -     * the VF FLR.
>>>>> -     */
>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>> -        return;
>>>>> -
>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>
>>>>>        xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>
>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>> work_struct *work)
>>>>>        } while (timeout > 1);
>>>>>
>>>>>    flr_done:
>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>> -    up_write(&adev->reset_sem);
>>>>> -
>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>            && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-03 10:17                 ` Christian König
@ 2022-01-04  9:07                   ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-04  9:07 UTC (permalink / raw)
  To: Christian König, Andrey Grodzovsky, Deng, Emily, Liu, Monk,
	Koenig, Christian, dri-devel, amd-gfx, Chen, Horace, Chen,
	 JingWen

Hi Christian,
I'm not sure what do you mean by "we need to change SRIOV not the driver".

Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.

From my point of view, we can directly use
amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.

Best Regards,
Jingwen Chen

On 2022/1/3 下午6:17, Christian König wrote:
> Please don't. This patch is vital to the cleanup of the reset procedure.
>
> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>
> Christian.
>
> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>> Sure, I guess i can drop this patch then.
>>
>> Andrey
>>
>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>
>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>
>>>> Best wishes
>>>> Emily Deng
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>> Cc: daniel@ffwll.ch
>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>> for SRIOV
>>>>>
>>>>> [AMD Official Use Only]
>>>>>
>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>
>>>>> Please take a review on Andrey's patch
>>>>>
>>>>> Thanks
>>>>> -------------------------------------------------------------------
>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>> -------------------------------------------------------------------
>>>>> we are hiring software manager for CVS core team
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>> <Horace.Chen@amd.com>
>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>> for SRIOV
>>>>>
>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>> Since now flr work is serialized against  GPU resets there is no need
>>>>>> for this.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>    2 files changed, 22 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>>> amdgpu_device, virt);
>>>>>>        int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>
>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>> -     * the VF FLR.
>>>>>> -     */
>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>> -        return;
>>>>>> -
>>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>
>>>>>>        xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>
>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        } while (timeout > 1);
>>>>>>
>>>>>>    flr_done:
>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>> -    up_write(&adev->reset_sem);
>>>>>> -
>>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>>            && (!amdgpu_device_has_job_running(adev) || diff --git
>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>>> amdgpu_device, virt);
>>>>>>        int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>
>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>> -     * the VF FLR.
>>>>>> -     */
>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>> -        return;
>>>>>> -
>>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>
>>>>>>        xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>
>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        } while (timeout > 1);
>>>>>>
>>>>>>    flr_done:
>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>> -    up_write(&adev->reset_sem);
>>>>>> -
>>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>>            && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-04  9:07                   ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-04  9:07 UTC (permalink / raw)
  To: Christian König, Andrey Grodzovsky, Deng, Emily, Liu, Monk,
	Koenig, Christian, dri-devel, amd-gfx, Chen, Horace, Chen,
	 JingWen
  Cc: daniel

Hi Christian,
I'm not sure what do you mean by "we need to change SRIOV not the driver".

Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.

From my point of view, we can directly use
amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.

Best Regards,
Jingwen Chen

On 2022/1/3 下午6:17, Christian König wrote:
> Please don't. This patch is vital to the cleanup of the reset procedure.
>
> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>
> Christian.
>
> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>> Sure, I guess i can drop this patch then.
>>
>> Andrey
>>
>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>
>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>
>>>> Best wishes
>>>> Emily Deng
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>> Cc: daniel@ffwll.ch
>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>> for SRIOV
>>>>>
>>>>> [AMD Official Use Only]
>>>>>
>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>
>>>>> Please take a review on Andrey's patch
>>>>>
>>>>> Thanks
>>>>> -------------------------------------------------------------------
>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>> -------------------------------------------------------------------
>>>>> we are hiring software manager for CVS core team
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>> <Horace.Chen@amd.com>
>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>> for SRIOV
>>>>>
>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>> Since now flr work is serialized against  GPU resets there is no need
>>>>>> for this.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>    2 files changed, 22 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>>> amdgpu_device, virt);
>>>>>>        int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>
>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>> -     * the VF FLR.
>>>>>> -     */
>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>> -        return;
>>>>>> -
>>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>
>>>>>>        xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>
>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        } while (timeout > 1);
>>>>>>
>>>>>>    flr_done:
>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>> -    up_write(&adev->reset_sem);
>>>>>> -
>>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>>            && (!amdgpu_device_has_job_running(adev) || diff --git
>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>>> amdgpu_device, virt);
>>>>>>        int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>
>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>> -     * the VF FLR.
>>>>>> -     */
>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>> -        return;
>>>>>> -
>>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>
>>>>>>        xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>
>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        } while (timeout > 1);
>>>>>>
>>>>>>    flr_done:
>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>> -    up_write(&adev->reset_sem);
>>>>>> -
>>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>>            && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04  9:07                   ` JingWen Chen
@ 2022-01-04 10:18                     ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-04 10:18 UTC (permalink / raw)
  To: JingWen Chen, Christian König, Andrey Grodzovsky, Deng,
	Emily, Liu, Monk, dri-devel, amd-gfx, Chen, Horace, Chen,
	JingWen

Hi Jingwen,

well what I mean is that we need to adjust the implementation in amdgpu 
to actually match the requirements.

Could be that the reset sequence is questionable in general, but I doubt 
so at least for now.

See the FLR request from the hypervisor is just another source of 
signaling the need for a reset, similar to each job timeout on each 
queue. Otherwise you have a race condition between the hypervisor and 
the scheduler.

Properly setting in_gpu_reset is indeed mandatory, but should happen at 
a central place and not in the SRIOV specific code.

In other words I strongly think that the current SRIOV reset 
implementation is severely broken and what Andrey is doing is actually 
fixing it.

Regards,
Christian.

Am 04.01.22 um 10:07 schrieb JingWen Chen:
> Hi Christian,
> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>
> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>
>  From my point of view, we can directly use
> amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>
> Best Regards,
> Jingwen Chen
>
> On 2022/1/3 下午6:17, Christian König wrote:
>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>
>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>
>> Christian.
>>
>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>> Sure, I guess i can drop this patch then.
>>>
>>> Andrey
>>>
>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>
>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>>>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>>>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>> Cc: daniel@ffwll.ch
>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>>> for SRIOV
>>>>>>
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>
>>>>>> Please take a review on Andrey's patch
>>>>>>
>>>>>> Thanks
>>>>>> -------------------------------------------------------------------
>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>> -------------------------------------------------------------------
>>>>>> we are hiring software manager for CVS core team
>>>>>> -------------------------------------------------------------------
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>> <Horace.Chen@amd.com>
>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>>> for SRIOV
>>>>>>
>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>> Since now flr work is serialized against  GPU resets there is no need
>>>>>>> for this.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>     2 files changed, 22 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>>
>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) || diff --git
>>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>>
>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-04 10:18                     ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-04 10:18 UTC (permalink / raw)
  To: JingWen Chen, Christian König, Andrey Grodzovsky, Deng,
	Emily, Liu, Monk, dri-devel, amd-gfx, Chen, Horace, Chen,
	JingWen
  Cc: daniel

Hi Jingwen,

well what I mean is that we need to adjust the implementation in amdgpu 
to actually match the requirements.

Could be that the reset sequence is questionable in general, but I doubt 
so at least for now.

See the FLR request from the hypervisor is just another source of 
signaling the need for a reset, similar to each job timeout on each 
queue. Otherwise you have a race condition between the hypervisor and 
the scheduler.

Properly setting in_gpu_reset is indeed mandatory, but should happen at 
a central place and not in the SRIOV specific code.

In other words I strongly think that the current SRIOV reset 
implementation is severely broken and what Andrey is doing is actually 
fixing it.

Regards,
Christian.

Am 04.01.22 um 10:07 schrieb JingWen Chen:
> Hi Christian,
> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>
> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>
>  From my point of view, we can directly use
> amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>
> Best Regards,
> Jingwen Chen
>
> On 2022/1/3 下午6:17, Christian König wrote:
>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>
>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>
>> Christian.
>>
>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>> Sure, I guess i can drop this patch then.
>>>
>>> Andrey
>>>
>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>
>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>>>>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>>>>>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>> Cc: daniel@ffwll.ch
>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>>> for SRIOV
>>>>>>
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>
>>>>>> Please take a review on Andrey's patch
>>>>>>
>>>>>> Thanks
>>>>>> -------------------------------------------------------------------
>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>> -------------------------------------------------------------------
>>>>>> we are hiring software manager for CVS core team
>>>>>> -------------------------------------------------------------------
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>> <Horace.Chen@amd.com>
>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>>>>>> for SRIOV
>>>>>>
>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>> Since now flr work is serialized against  GPU resets there is no need
>>>>>>> for this.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>     2 files changed, 22 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>>
>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) || diff --git
>>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>>
>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 10:18                     ` Christian König
@ 2022-01-04 10:49                       ` Liu, Monk
  -1 siblings, 0 replies; 103+ messages in thread
From: Liu, Monk @ 2022-01-04 10:49 UTC (permalink / raw)
  To: Koenig, Christian, Chen, JingWen, Christian König,
	Grodzovsky, Andrey, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, Chen, JingWen

[AMD Official Use Only]

>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long

>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
It makes the code to crash ... how could it be a fix ?

I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.

Thanks 
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Solution | AMD
-------------------------------------------------------------------
we are hiring software manager for CVS core team
-------------------------------------------------------------------

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Tuesday, January 4, 2022 6:19 PM
To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
Cc: daniel@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Hi Jingwen,

well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.

Could be that the reset sequence is questionable in general, but I doubt so at least for now.

See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.

Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.

In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.

Regards,
Christian.

Am 04.01.22 um 10:07 schrieb JingWen Chen:
> Hi Christian,
> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>
> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>
>  From my point of view, we can directly use amdgpu_device_lock_adev 
> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>
> Best Regards,
> Jingwen Chen
>
> On 2022/1/3 下午6:17, Christian König wrote:
>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>
>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>
>> Christian.
>>
>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>> Sure, I guess i can drop this patch then.
>>>
>>> Andrey
>>>
>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>
>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, 
>>>>>> Andrey <Andrey.Grodzovsky@amd.com>; 
>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org; 
>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen 
>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>> Cc: daniel@ffwll.ch
>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>>> protection for SRIOV
>>>>>>
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>
>>>>>> Please take a review on Andrey's patch
>>>>>>
>>>>>> Thanks
>>>>>> -----------------------------------------------------------------
>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>> -----------------------------------------------------------------
>>>>>> -- we are hiring software manager for CVS core team
>>>>>> -----------------------------------------------------------------
>>>>>> --
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri- 
>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace 
>>>>>> <Horace.Chen@amd.com>
>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>>> protection for SRIOV
>>>>>>
>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>> Since now flr work is serialized against  GPU resets there is no 
>>>>>>> need for this.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>     2 files changed, 22 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>> 0, 0);
>>>>>>>
>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR 
>>>>>>> */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>> 0, 0);
>>>>>>>
>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR 
>>>>>>> */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-04 10:49                       ` Liu, Monk
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Monk @ 2022-01-04 10:49 UTC (permalink / raw)
  To: Koenig, Christian, Chen, JingWen, Christian König,
	Grodzovsky, Andrey, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, Chen, JingWen
  Cc: daniel

[AMD Official Use Only]

>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long

>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
It makes the code to crash ... how could it be a fix ?

I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.

Thanks 
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Solution | AMD
-------------------------------------------------------------------
we are hiring software manager for CVS core team
-------------------------------------------------------------------

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Tuesday, January 4, 2022 6:19 PM
To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
Cc: daniel@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Hi Jingwen,

well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.

Could be that the reset sequence is questionable in general, but I doubt so at least for now.

See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.

Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.

In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.

Regards,
Christian.

Am 04.01.22 um 10:07 schrieb JingWen Chen:
> Hi Christian,
> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>
> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>
>  From my point of view, we can directly use amdgpu_device_lock_adev 
> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>
> Best Regards,
> Jingwen Chen
>
> On 2022/1/3 下午6:17, Christian König wrote:
>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>
>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>
>> Christian.
>>
>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>> Sure, I guess i can drop this patch then.
>>>
>>> Andrey
>>>
>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>
>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, 
>>>>>> Andrey <Andrey.Grodzovsky@amd.com>; 
>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org; 
>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen 
>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>> Cc: daniel@ffwll.ch
>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>>> protection for SRIOV
>>>>>>
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>
>>>>>> Please take a review on Andrey's patch
>>>>>>
>>>>>> Thanks
>>>>>> -----------------------------------------------------------------
>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>> -----------------------------------------------------------------
>>>>>> -- we are hiring software manager for CVS core team
>>>>>> -----------------------------------------------------------------
>>>>>> --
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri- 
>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace 
>>>>>> <Horace.Chen@amd.com>
>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>>> protection for SRIOV
>>>>>>
>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>> Since now flr work is serialized against  GPU resets there is no 
>>>>>>> need for this.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>     drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>     2 files changed, 22 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>> 0, 0);
>>>>>>>
>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR 
>>>>>>> */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         struct amdgpu_device *adev = container_of(virt, struct
>>>>>> amdgpu_device, virt);
>>>>>>>         int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>
>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>> -     * the VF FLR.
>>>>>>> -     */
>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>> -        return;
>>>>>>> -
>>>>>>>         amdgpu_virt_fini_data_exchange(adev);
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>
>>>>>>>         xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>> 0, 0);
>>>>>>>
>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>> work_struct *work)
>>>>>>>         } while (timeout > 1);
>>>>>>>
>>>>>>>     flr_done:
>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>> -
>>>>>>>         /* Trigger recovery for world switch failure if no TDR 
>>>>>>> */
>>>>>>>         if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>             && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 10:49                       ` Liu, Monk
@ 2022-01-04 11:36                         ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-04 11:36 UTC (permalink / raw)
  To: Liu, Monk, Chen, JingWen, Christian König, Grodzovsky,
	Andrey, Deng, Emily, dri-devel, amd-gfx, Chen, Horace

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>

Then we have a major design issue in the SRIOV protocol and really need 
to question this.

How do you want to prevent a race between the hypervisor resetting the 
hardware and the client trying the same because of a timeout?

As far as I can see the procedure should be:
1. We detect that a reset is necessary, either because of a fault a 
timeout or signal from hypervisor.
2. For each of those potential reset sources a work item is send to the 
single workqueue.
3. One of those work items execute first and prepares the reset.
4. We either do the reset our self or notify the hypervisor that we are 
ready for the reset.
5. Cleanup after the reset, eventually resubmit jobs etc..
6. Cancel work items which might have been scheduled from other reset 
sources.

It does make sense that the hypervisor resets the hardware without 
waiting for the clients for too long, but if we don't follow this 
general steps we will always have a race between the different components.

Regards,
Christian.

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>
>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
> It makes the code to crash ... how could it be a fix ?
>
> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>
> Thanks
> -------------------------------------------------------------------
> Monk Liu | Cloud GPU & Virtualization Solution | AMD
> -------------------------------------------------------------------
> we are hiring software manager for CVS core team
> -------------------------------------------------------------------
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Tuesday, January 4, 2022 6:19 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>
> Hi Jingwen,
>
> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>
> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>
> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>
> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>
> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>> Hi Christian,
>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>
>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>
>>   From my point of view, we can directly use amdgpu_device_lock_adev
>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>
>> Best Regards,
>> Jingwen Chen
>>
>> On 2022/1/3 下午6:17, Christian König wrote:
>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>
>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>
>>> Christian.
>>>
>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>> Sure, I guess i can drop this patch then.
>>>>
>>>> Andrey
>>>>
>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>
>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>> Cc: daniel@ffwll.ch
>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>> protection for SRIOV
>>>>>>>
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>
>>>>>>> Please take a review on Andrey's patch
>>>>>>>
>>>>>>> Thanks
>>>>>>> -----------------------------------------------------------------
>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>> -----------------------------------------------------------------
>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>> -----------------------------------------------------------------
>>>>>>> --
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>> <Horace.Chen@amd.com>
>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>> protection for SRIOV
>>>>>>>
>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>> need for this.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-04 11:36                         ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-04 11:36 UTC (permalink / raw)
  To: Liu, Monk, Chen, JingWen, Christian König, Grodzovsky,
	Andrey, Deng, Emily, dri-devel, amd-gfx, Chen, Horace
  Cc: daniel

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>

Then we have a major design issue in the SRIOV protocol and really need 
to question this.

How do you want to prevent a race between the hypervisor resetting the 
hardware and the client trying the same because of a timeout?

As far as I can see the procedure should be:
1. We detect that a reset is necessary, either because of a fault a 
timeout or signal from hypervisor.
2. For each of those potential reset sources a work item is send to the 
single workqueue.
3. One of those work items execute first and prepares the reset.
4. We either do the reset our self or notify the hypervisor that we are 
ready for the reset.
5. Cleanup after the reset, eventually resubmit jobs etc..
6. Cancel work items which might have been scheduled from other reset 
sources.

It does make sense that the hypervisor resets the hardware without 
waiting for the clients for too long, but if we don't follow this 
general steps we will always have a race between the different components.

Regards,
Christian.

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>
>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
> It makes the code to crash ... how could it be a fix ?
>
> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>
> Thanks
> -------------------------------------------------------------------
> Monk Liu | Cloud GPU & Virtualization Solution | AMD
> -------------------------------------------------------------------
> we are hiring software manager for CVS core team
> -------------------------------------------------------------------
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Tuesday, January 4, 2022 6:19 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>
> Hi Jingwen,
>
> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>
> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>
> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>
> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>
> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>> Hi Christian,
>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>
>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>
>>   From my point of view, we can directly use amdgpu_device_lock_adev
>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>
>> Best Regards,
>> Jingwen Chen
>>
>> On 2022/1/3 下午6:17, Christian König wrote:
>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>
>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>
>>> Christian.
>>>
>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>> Sure, I guess i can drop this patch then.
>>>>
>>>> Andrey
>>>>
>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>
>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>> Cc: daniel@ffwll.ch
>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>> protection for SRIOV
>>>>>>>
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>
>>>>>>> Please take a review on Andrey's patch
>>>>>>>
>>>>>>> Thanks
>>>>>>> -----------------------------------------------------------------
>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>> -----------------------------------------------------------------
>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>> -----------------------------------------------------------------
>>>>>>> --
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>> <Horace.Chen@amd.com>
>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>> protection for SRIOV
>>>>>>>
>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>> need for this.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 11:36                         ` Christian König
@ 2022-01-04 16:56                           ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-04 16:56 UTC (permalink / raw)
  To: Christian König, Liu, Monk, Chen, JingWen,
	Christian König, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace


On 2022-01-04 6:36 a.m., Christian König wrote:
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each 
>>>> queue. Otherwise you have a race condition between the hypervisor 
>>>> and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>
> Then we have a major design issue in the SRIOV protocol and really 
> need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the 
> hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a 
> timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to 
> the single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we 
> are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset 
> sources.
>
> It does make sense that the hypervisor resets the hardware without 
> waiting for the clients for too long, but if we don't follow this 
> general steps we will always have a race between the different 
> components.


Monk, just to add to this - if indeed as you say that 'FLR from 
hypervisor is just to notify guest the hw VF FLR is about to start or 
was already executed, but host will do FLR anyway without waiting for 
guest too long'
and there is no strict waiting from the hypervisor for 
IDH_READY_TO_RESET to be recived from guest before starting the reset 
then setting in_gpu_reset and locking reset_sem from guest side is not 
really full proof
protection from MMIO accesses by the guest - it only truly helps if 
hypervisor waits for that message before initiation of HW reset.

Andrey


>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each 
>>>> queue. Otherwise you have a race condition between the hypervisor 
>>>> and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset 
>>>> implementation is severely broken and what Andrey is doing is 
>>>> actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the 
>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, 
>> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>> Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in 
>> amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I 
>> doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of 
>> signaling the need for a reset, similar to each job timeout on each 
>> queue. Otherwise you have a race condition between the hypervisor and 
>> the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen 
>> at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset 
>> implementation is severely broken and what Andrey is doing is 
>> actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>> driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will 
>>> be a huge change for our SRIOV solution.
>>>
>>>   From my point of view, we can directly use amdgpu_device_lock_adev
>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since 
>>> no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device 
>>> untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>> procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the 
>>>> driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>> first, and do the flr, guest side thread may not know this and 
>>>>>> still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset 
>>>>>> and reset_sem to identify the reset status). And this may lead to 
>>>>>> very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches 
>>>>>>> and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>> need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-04 16:56                           ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-04 16:56 UTC (permalink / raw)
  To: Christian König, Liu, Monk, Chen, JingWen,
	Christian König, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel


On 2022-01-04 6:36 a.m., Christian König wrote:
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each 
>>>> queue. Otherwise you have a race condition between the hypervisor 
>>>> and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>
> Then we have a major design issue in the SRIOV protocol and really 
> need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the 
> hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a 
> timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to 
> the single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we 
> are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset 
> sources.
>
> It does make sense that the hypervisor resets the hardware without 
> waiting for the clients for too long, but if we don't follow this 
> general steps we will always have a race between the different 
> components.


Monk, just to add to this - if indeed as you say that 'FLR from 
hypervisor is just to notify guest the hw VF FLR is about to start or 
was already executed, but host will do FLR anyway without waiting for 
guest too long'
and there is no strict waiting from the hypervisor for 
IDH_READY_TO_RESET to be recived from guest before starting the reset 
then setting in_gpu_reset and locking reset_sem from guest side is not 
really full proof
protection from MMIO accesses by the guest - it only truly helps if 
hypervisor waits for that message before initiation of HW reset.

Andrey


>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each 
>>>> queue. Otherwise you have a race condition between the hypervisor 
>>>> and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset 
>>>> implementation is severely broken and what Andrey is doing is 
>>>> actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the 
>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, 
>> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>> Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in 
>> amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I 
>> doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of 
>> signaling the need for a reset, similar to each job timeout on each 
>> queue. Otherwise you have a race condition between the hypervisor and 
>> the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen 
>> at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset 
>> implementation is severely broken and what Andrey is doing is 
>> actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>> driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will 
>>> be a huge change for our SRIOV solution.
>>>
>>>   From my point of view, we can directly use amdgpu_device_lock_adev
>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since 
>>> no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device 
>>> untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>> procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the 
>>>> driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>> first, and do the flr, guest side thread may not know this and 
>>>>>> still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset 
>>>>>> and reset_sem to identify the reset status). And this may lead to 
>>>>>> very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches 
>>>>>>> and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>> need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 11:36                         ` Christian König
@ 2022-01-04 17:13                           ` Liu, Shaoyun
  -1 siblings, 0 replies; 103+ messages in thread
From: Liu, Shaoyun @ 2022-01-04 17:13 UTC (permalink / raw)
  To: Koenig, Christian, Liu, Monk, Chen, JingWen,
	Christian König, Grodzovsky, Andrey, Deng, Emily, dri-devel,
	amd-gfx, Chen, Horace

[AMD Official Use Only]

I mostly agree with the sequences Christian  described .  Just one  thing might need to  discuss here.  For FLR notified from host,  in new sequenceas described  , driver need to reply the  READY_TO_RESET in the  workitem  from a reset  work queue which means inside flr_work, driver can not directly reply to host but need to queue another workqueue . For current  code ,  the flr_work for sriov itself is a work queue queued from ISR .  I think we should try to response to the host driver as soon as possible.  Queue another workqueue  inside  the workqueue  doesn't sounds efficient to me.  
Anyway, what we need is a working  solution for our project.  So if we need to change the sequence, we  need to make sure it's been tested first and won't break the functionality before the code is landed in the branch . 

Regards
Shaoyun.liu


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Tuesday, January 4, 2022 6:36 AM
To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>
Cc: daniel@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>

Then we have a major design issue in the SRIOV protocol and really need to question this.

How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?

As far as I can see the procedure should be:
1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
2. For each of those potential reset sources a work item is send to the single workqueue.
3. One of those work items execute first and prepares the reset.
4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
5. Cleanup after the reset, eventually resubmit jobs etc..
6. Cancel work items which might have been scheduled from other reset sources.

It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.

Regards,
Christian.

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>
>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
> It makes the code to crash ... how could it be a fix ?
>
> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>
> Thanks
> -------------------------------------------------------------------
> Monk Liu | Cloud GPU & Virtualization Solution | AMD
> -------------------------------------------------------------------
> we are hiring software manager for CVS core team
> -------------------------------------------------------------------
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Tuesday, January 4, 2022 6:19 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, 
> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
> Chen, JingWen <JingWen.Chen2@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Hi Jingwen,
>
> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>
> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>
> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>
> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>
> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>> Hi Christian,
>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>
>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>
>>   From my point of view, we can directly use amdgpu_device_lock_adev 
>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>
>> Best Regards,
>> Jingwen Chen
>>
>> On 2022/1/3 下午6:17, Christian König wrote:
>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>
>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>
>>> Christian.
>>>
>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>> Sure, I guess i can drop this patch then.
>>>>
>>>> Andrey
>>>>
>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>
>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, 
>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>; 
>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org; 
>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen 
>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>> Cc: daniel@ffwll.ch
>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>> reset protection for SRIOV
>>>>>>>
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>
>>>>>>> Please take a review on Andrey's patch
>>>>>>>
>>>>>>> Thanks
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> --
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri- 
>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace 
>>>>>>> <Horace.Chen@amd.com>
>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>> reset protection for SRIOV
>>>>>>>
>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>> no need for this.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>> received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR 
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>> received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR 
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-04 17:13                           ` Liu, Shaoyun
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Shaoyun @ 2022-01-04 17:13 UTC (permalink / raw)
  To: Koenig, Christian, Liu, Monk, Chen, JingWen,
	Christian König, Grodzovsky, Andrey, Deng, Emily, dri-devel,
	amd-gfx, Chen, Horace
  Cc: daniel

[AMD Official Use Only]

I mostly agree with the sequences Christian  described .  Just one  thing might need to  discuss here.  For FLR notified from host,  in new sequenceas described  , driver need to reply the  READY_TO_RESET in the  workitem  from a reset  work queue which means inside flr_work, driver can not directly reply to host but need to queue another workqueue . For current  code ,  the flr_work for sriov itself is a work queue queued from ISR .  I think we should try to response to the host driver as soon as possible.  Queue another workqueue  inside  the workqueue  doesn't sounds efficient to me.  
Anyway, what we need is a working  solution for our project.  So if we need to change the sequence, we  need to make sure it's been tested first and won't break the functionality before the code is landed in the branch . 

Regards
Shaoyun.liu


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Tuesday, January 4, 2022 6:36 AM
To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>
Cc: daniel@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>

Then we have a major design issue in the SRIOV protocol and really need to question this.

How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?

As far as I can see the procedure should be:
1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
2. For each of those potential reset sources a work item is send to the single workqueue.
3. One of those work items execute first and prepares the reset.
4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
5. Cleanup after the reset, eventually resubmit jobs etc..
6. Cancel work items which might have been scheduled from other reset sources.

It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.

Regards,
Christian.

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>
>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
> It makes the code to crash ... how could it be a fix ?
>
> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>
> Thanks
> -------------------------------------------------------------------
> Monk Liu | Cloud GPU & Virtualization Solution | AMD
> -------------------------------------------------------------------
> we are hiring software manager for CVS core team
> -------------------------------------------------------------------
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Tuesday, January 4, 2022 6:19 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, 
> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
> Chen, JingWen <JingWen.Chen2@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Hi Jingwen,
>
> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>
> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>
> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>
> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>
> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>> Hi Christian,
>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>
>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>
>>   From my point of view, we can directly use amdgpu_device_lock_adev 
>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>
>> Best Regards,
>> Jingwen Chen
>>
>> On 2022/1/3 下午6:17, Christian König wrote:
>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>
>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>
>>> Christian.
>>>
>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>> Sure, I guess i can drop this patch then.
>>>>
>>>> Andrey
>>>>
>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>
>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, 
>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>; 
>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org; 
>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen 
>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>> Cc: daniel@ffwll.ch
>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>> reset protection for SRIOV
>>>>>>>
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>
>>>>>>> Please take a review on Andrey's patch
>>>>>>>
>>>>>>> Thanks
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> --
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri- 
>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace 
>>>>>>> <Horace.Chen@amd.com>
>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>> reset protection for SRIOV
>>>>>>>
>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>> no need for this.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>> received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR 
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>> received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR 
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 17:13                           ` Liu, Shaoyun
@ 2022-01-04 20:54                             ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-04 20:54 UTC (permalink / raw)
  To: Liu, Shaoyun, Koenig, Christian, Liu, Monk, Chen, JingWen,
	Christian König, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace

On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote:

> [AMD Official Use Only]
>
> I mostly agree with the sequences Christian  described .  Just one  thing might need to  discuss here.  For FLR notified from host,  in new sequenceas described  , driver need to reply the  READY_TO_RESET in the  workitem  from a reset  work queue which means inside flr_work, driver can not directly reply to host but need to queue another workqueue .


Can you clarify why 'driver can not directly reply to host but need to 
queue another workqueue' ? To my understating all steps 3-6 in 
Christian's description happen from the same single wq thread serially.


>   For current  code ,  the flr_work for sriov itself is a work queue queued from ISR .  I think we should try to response to the host driver as soon as possible.  Queue another workqueue  inside  the workqueue  doesn't sounds efficient to me.


Check patch 5 please [1] - I just substituted 
schedule_work(&adev->virt.flr_work) for 
queue_work(adev->reset_domain.wq,&adev->virt.flr_work) so no extra 
requeue here, just instead of sending to system_wq it's sent
to dedicated reset wq

[1] - 
https://lore.kernel.org/all/20211222221400.790842-1-andrey.grodzovsky@amd.com/

Andrey


> Anyway, what we need is a working  solution for our project.  So if we need to change the sequence, we  need to make sure it's been tested first and won't break the functionality before the code is landed in the branch .
>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
> Sent: Tuesday, January 4, 2022 6:36 AM
> To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR
>> is about to start or was already executed, but host will do FLR anyway
>> without waiting for guest too long
>>
> Then we have a major design issue in the SRIOV protocol and really need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR
>> is about to start or was already executed, but host will do FLR anyway
>> without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König
>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu,
>> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org;
>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>;
>> Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>
>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> -
>>>>>>>> --
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is
>>>>>>>>> no need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-04 20:54                             ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-04 20:54 UTC (permalink / raw)
  To: Liu, Shaoyun, Koenig, Christian, Liu, Monk, Chen, JingWen,
	Christian König, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel

On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote:

> [AMD Official Use Only]
>
> I mostly agree with the sequences Christian  described .  Just one  thing might need to  discuss here.  For FLR notified from host,  in new sequenceas described  , driver need to reply the  READY_TO_RESET in the  workitem  from a reset  work queue which means inside flr_work, driver can not directly reply to host but need to queue another workqueue .


Can you clarify why 'driver can not directly reply to host but need to 
queue another workqueue' ? To my understating all steps 3-6 in 
Christian's description happen from the same single wq thread serially.


>   For current  code ,  the flr_work for sriov itself is a work queue queued from ISR .  I think we should try to response to the host driver as soon as possible.  Queue another workqueue  inside  the workqueue  doesn't sounds efficient to me.


Check patch 5 please [1] - I just substituted 
schedule_work(&adev->virt.flr_work) for 
queue_work(adev->reset_domain.wq,&adev->virt.flr_work) so no extra 
requeue here, just instead of sending to system_wq it's sent
to dedicated reset wq

[1] - 
https://lore.kernel.org/all/20211222221400.790842-1-andrey.grodzovsky@amd.com/

Andrey


> Anyway, what we need is a working  solution for our project.  So if we need to change the sequence, we  need to make sure it's been tested first and won't break the functionality before the code is landed in the branch .
>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
> Sent: Tuesday, January 4, 2022 6:36 AM
> To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR
>> is about to start or was already executed, but host will do FLR anyway
>> without waiting for guest too long
>>
> Then we have a major design issue in the SRIOV protocol and really need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR
>> is about to start or was already executed, but host will do FLR anyway
>> without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König
>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu,
>> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org;
>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>;
>> Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>
>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> -
>>>>>>>> --
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is
>>>>>>>>> no need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 20:54                             ` Andrey Grodzovsky
@ 2022-01-05  0:01                               ` Liu, Shaoyun
  -1 siblings, 0 replies; 103+ messages in thread
From: Liu, Shaoyun @ 2022-01-05  0:01 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian, Liu, Monk, Chen, JingWen,
	Christian König, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace

[AMD Official Use Only]

I see, I didn't notice you already have  this implemented . so the flr_work routine itself is synced now, in this case , I  agree it should be safe to remove the in_gpu_reset and  reset_semm in the flr_work. 

Regards
Shaoyun.liu

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Tuesday, January 4, 2022 3:55 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Deng, Emily <Emily.Deng@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>
Cc: daniel@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote:

> [AMD Official Use Only]
>
> I mostly agree with the sequences Christian  described .  Just one  thing might need to  discuss here.  For FLR notified from host,  in new sequenceas described  , driver need to reply the  READY_TO_RESET in the  workitem  from a reset  work queue which means inside flr_work, driver can not directly reply to host but need to queue another workqueue .


Can you clarify why 'driver can not directly reply to host but need to queue another workqueue' ? To my understating all steps 3-6 in Christian's description happen from the same single wq thread serially.


>   For current  code ,  the flr_work for sriov itself is a work queue queued from ISR .  I think we should try to response to the host driver as soon as possible.  Queue another workqueue  inside  the workqueue  doesn't sounds efficient to me.


Check patch 5 please [1] - I just substituted
schedule_work(&adev->virt.flr_work) for
queue_work(adev->reset_domain.wq,&adev->virt.flr_work) so no extra requeue here, just instead of sending to system_wq it's sent to dedicated reset wq

[1] -
https://lore.kernel.org/all/20211222221400.790842-1-andrey.grodzovsky@amd.com/

Andrey


> Anyway, what we need is a working  solution for our project.  So if we need to change the sequence, we  need to make sure it's been tested first and won't break the functionality before the code is landed in the branch .
>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of 
> Christian König
> Sent: Tuesday, January 4, 2022 6:36 AM
> To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen 
> <JingWen.Chen2@amd.com>; Christian König 
> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 
> dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, 
> Horace <Horace.Chen@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
> Then we have a major design issue in the SRIOV protocol and really need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, 
>> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>> Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>
>>>    From my point of view, we can directly use 
>>> amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, 
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>; 
>>>>>>>> dri-devel@lists.freedesktop.org; amd- 
>>>>>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>>>>>>>> Chen, JingWen <JingWen.Chen2@amd.com>; Deng, Emily 
>>>>>>>> <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> --
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri- 
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace 
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>>> no need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-05  0:01                               ` Liu, Shaoyun
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Shaoyun @ 2022-01-05  0:01 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian, Liu, Monk, Chen, JingWen,
	Christian König, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel

[AMD Official Use Only]

I see, I didn't notice you already have  this implemented . so the flr_work routine itself is synced now, in this case , I  agree it should be safe to remove the in_gpu_reset and  reset_semm in the flr_work. 

Regards
Shaoyun.liu

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Tuesday, January 4, 2022 3:55 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Deng, Emily <Emily.Deng@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>
Cc: daniel@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote:

> [AMD Official Use Only]
>
> I mostly agree with the sequences Christian  described .  Just one  thing might need to  discuss here.  For FLR notified from host,  in new sequenceas described  , driver need to reply the  READY_TO_RESET in the  workitem  from a reset  work queue which means inside flr_work, driver can not directly reply to host but need to queue another workqueue .


Can you clarify why 'driver can not directly reply to host but need to queue another workqueue' ? To my understating all steps 3-6 in Christian's description happen from the same single wq thread serially.


>   For current  code ,  the flr_work for sriov itself is a work queue queued from ISR .  I think we should try to response to the host driver as soon as possible.  Queue another workqueue  inside  the workqueue  doesn't sounds efficient to me.


Check patch 5 please [1] - I just substituted
schedule_work(&adev->virt.flr_work) for
queue_work(adev->reset_domain.wq,&adev->virt.flr_work) so no extra requeue here, just instead of sending to system_wq it's sent to dedicated reset wq

[1] -
https://lore.kernel.org/all/20211222221400.790842-1-andrey.grodzovsky@amd.com/

Andrey


> Anyway, what we need is a working  solution for our project.  So if we need to change the sequence, we  need to make sure it's been tested first and won't break the functionality before the code is landed in the branch .
>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of 
> Christian König
> Sent: Tuesday, January 4, 2022 6:36 AM
> To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen 
> <JingWen.Chen2@amd.com>; Christian König 
> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 
> dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, 
> Horace <Horace.Chen@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
> Then we have a major design issue in the SRIOV protocol and really need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, 
>> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>> Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>
>>>    From my point of view, we can directly use 
>>> amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, 
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>; 
>>>>>>>> dri-devel@lists.freedesktop.org; amd- 
>>>>>>>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>>>>>>>> Chen, JingWen <JingWen.Chen2@amd.com>; Deng, Emily 
>>>>>>>> <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> --
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri- 
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace 
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>>> no need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 11:36                         ` Christian König
@ 2022-01-05  7:25                           ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-05  7:25 UTC (permalink / raw)
  To: Christian König, Liu, Monk, Chen, JingWen,
	Christian König, Grodzovsky, Andrey, Deng, Emily, dri-devel,
	amd-gfx, Chen, Horace


On 2022/1/4 下午7:36, Christian König wrote:
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>
>
> Then we have a major design issue in the SRIOV protocol and really need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the single workqueue.

I think Andrey has already used the same ordered work queue to handle the reset from both ring timeout and hypervisor. (Patch 5)

So there should be no race between different reset sources. As ring timeout is much longer than the world switch time slice(6ms), we should see a reset from hypervisor queued into reset domain wq first and after the flr work done, then the ring timeout reset queued into reset domain.

> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.

So the reset_sem and in_gpu_reset is to prevent race between reset_domain(mostly hypervisor source) and other user spaces(e.g. kfd).

>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>
>>>   From my point of view, we can directly use amdgpu_device_lock_adev
>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>> need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-05  7:25                           ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-05  7:25 UTC (permalink / raw)
  To: Christian König, Liu, Monk, Chen, JingWen,
	Christian König, Grodzovsky, Andrey, Deng, Emily, dri-devel,
	amd-gfx, Chen, Horace
  Cc: daniel


On 2022/1/4 下午7:36, Christian König wrote:
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>
>
> Then we have a major design issue in the SRIOV protocol and really need to question this.
>
> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the single workqueue.

I think Andrey has already used the same ordered work queue to handle the reset from both ring timeout and hypervisor. (Patch 5)

So there should be no race between different reset sources. As ring timeout is much longer than the world switch time slice(6ms), we should see a reset from hypervisor queued into reset domain wq first and after the flr work done, then the ring timeout reset queued into reset domain.

> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.

So the reset_sem and in_gpu_reset is to prevent race between reset_domain(mostly hypervisor source) and other user spaces(e.g. kfd).

>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>
>>>   From my point of view, we can directly use amdgpu_device_lock_adev
>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>> protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>> need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>> 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>      flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>> */
>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-04 16:56                           ` Andrey Grodzovsky
@ 2022-01-05  7:34                             ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-05  7:34 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Liu, Monk, Chen,
	JingWen, Christian König, Deng, Emily, dri-devel, amd-gfx,
	Chen, Horace


On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>
> On 2022-01-04 6:36 a.m., Christian König wrote:
>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>> [AMD Official Use Only]
>>>
>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>
>>
>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>
>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>
>> As far as I can see the procedure should be:
>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>> 3. One of those work items execute first and prepares the reset.
>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>> 6. Cancel work items which might have been scheduled from other reset sources.
>>
>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>
>
> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>
Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
> Andrey
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>> [AMD Official Use Only]
>>>
>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>
>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>> It makes the code to crash ... how could it be a fix ?
>>>
>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>
>>> Thanks
>>> -------------------------------------------------------------------
>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>> -------------------------------------------------------------------
>>> we are hiring software manager for CVS core team
>>> -------------------------------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>> Cc: daniel@ffwll.ch
>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>
>>> Hi Jingwen,
>>>
>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>
>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>
>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>
>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>> Hi Christian,
>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>
>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>
>>>>   From my point of view, we can directly use amdgpu_device_lock_adev
>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>
>>>> Best Regards,
>>>> Jingwen Chen
>>>>
>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>
>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>> Sure, I guess i can drop this patch then.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>
>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Emily Deng
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>> protection for SRIOV
>>>>>>>>>
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>
>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>> protection for SRIOV
>>>>>>>>>
>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>> need for this.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>
>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>> -     */
>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>> -        return;
>>>>>>>>>> -
>>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>
>>>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>> 0, 0);
>>>>>>>>>>
>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>>
>>>>>>>>>>      flr_done:
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>> -
>>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>> */
>>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>
>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>> -     */
>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>> -        return;
>>>>>>>>>> -
>>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>
>>>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>> 0, 0);
>>>>>>>>>>
>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>>
>>>>>>>>>>      flr_done:
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>> -
>>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>> */
>>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||
>>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-05  7:34                             ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-05  7:34 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Liu, Monk, Chen,
	JingWen, Christian König, Deng, Emily, dri-devel, amd-gfx,
	Chen, Horace
  Cc: daniel


On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>
> On 2022-01-04 6:36 a.m., Christian König wrote:
>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>> [AMD Official Use Only]
>>>
>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>
>>
>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>
>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>
>> As far as I can see the procedure should be:
>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>> 3. One of those work items execute first and prepares the reset.
>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>> 6. Cancel work items which might have been scheduled from other reset sources.
>>
>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>
>
> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>
Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
> Andrey
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>> [AMD Official Use Only]
>>>
>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>
>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>> It makes the code to crash ... how could it be a fix ?
>>>
>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>
>>> Thanks
>>> -------------------------------------------------------------------
>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>> -------------------------------------------------------------------
>>> we are hiring software manager for CVS core team
>>> -------------------------------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>> Cc: daniel@ffwll.ch
>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>
>>> Hi Jingwen,
>>>
>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>
>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>
>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>
>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>> Hi Christian,
>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>
>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>
>>>>   From my point of view, we can directly use amdgpu_device_lock_adev
>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>
>>>> Best Regards,
>>>> Jingwen Chen
>>>>
>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>
>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>> Sure, I guess i can drop this patch then.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>
>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Emily Deng
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>> protection for SRIOV
>>>>>>>>>
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>
>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>> protection for SRIOV
>>>>>>>>>
>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>> need for this.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>
>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>> -     */
>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>> -        return;
>>>>>>>>>> -
>>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>
>>>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>> 0, 0);
>>>>>>>>>>
>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>>
>>>>>>>>>>      flr_done:
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>> -
>>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>> */
>>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>
>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>> -     */
>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>> -        return;
>>>>>>>>>> -
>>>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>
>>>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>> 0, 0);
>>>>>>>>>>
>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>>          } while (timeout > 1);
>>>>>>>>>>
>>>>>>>>>>      flr_done:
>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>> -
>>>>>>>>>>          /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>> */
>>>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||
>>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-05  7:34                             ` JingWen Chen
@ 2022-01-05  7:59                               ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05  7:59 UTC (permalink / raw)
  To: JingWen Chen, Andrey Grodzovsky, Christian König, Liu, Monk,
	Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen, Horace

Am 05.01.22 um 08:34 schrieb JingWen Chen:
> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>> [AMD Official Use Only]
>>>>
>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>
>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>
>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>
>>> As far as I can see the procedure should be:
>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>> 3. One of those work items execute first and prepares the reset.
>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>
>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>
>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>
> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.

Yeah, agree completely with JingWen. The hypervisor is the one in charge 
here, not the guest.

What the hypervisor should do (and it already seems to be designed that 
way) is to send the guest a message that a reset is about to happen and 
give it some time to response appropriately.

The guest on the other hand then tells the hypervisor that all 
processing has stopped and it is ready to restart. If that doesn't 
happen in time the hypervisor should eliminate the guest probably 
trigger even more severe consequences, e.g. restart the whole VM etc...

Christian.

>> Andrey
>>
>>
>>> Regards,
>>> Christian.
>>>
>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>> [AMD Official Use Only]
>>>>
>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>
>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>> It makes the code to crash ... how could it be a fix ?
>>>>
>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>
>>>> Thanks
>>>> -------------------------------------------------------------------
>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>> -------------------------------------------------------------------
>>>> we are hiring software manager for CVS core team
>>>> -------------------------------------------------------------------
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>> Cc: daniel@ffwll.ch
>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>
>>>> Hi Jingwen,
>>>>
>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>
>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>
>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>> Hi Christian,
>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>
>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>
>>>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>
>>>>> Best Regards,
>>>>> Jingwen Chen
>>>>>
>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>
>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>
>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>> Emily Deng
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>> protection for SRIOV
>>>>>>>>>>
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>
>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>> -- 
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>> protection for SRIOV
>>>>>>>>>>
>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>> need for this.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>
>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>> -     */
>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>> -        return;
>>>>>>>>>>> -
>>>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>
>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>> 0, 0);
>>>>>>>>>>>
>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>
>>>>>>>>>>>       flr_done:
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>> -
>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>> */
>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>
>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>> -     */
>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>> -        return;
>>>>>>>>>>> -
>>>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>
>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>> 0, 0);
>>>>>>>>>>>
>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>
>>>>>>>>>>>       flr_done:
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>> -
>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>> */
>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-05  7:59                               ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05  7:59 UTC (permalink / raw)
  To: JingWen Chen, Andrey Grodzovsky, Christian König, Liu, Monk,
	Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen, Horace
  Cc: daniel

Am 05.01.22 um 08:34 schrieb JingWen Chen:
> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>> [AMD Official Use Only]
>>>>
>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>
>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>
>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>
>>> As far as I can see the procedure should be:
>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>> 3. One of those work items execute first and prepares the reset.
>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>
>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>
>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>
> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.

Yeah, agree completely with JingWen. The hypervisor is the one in charge 
here, not the guest.

What the hypervisor should do (and it already seems to be designed that 
way) is to send the guest a message that a reset is about to happen and 
give it some time to response appropriately.

The guest on the other hand then tells the hypervisor that all 
processing has stopped and it is ready to restart. If that doesn't 
happen in time the hypervisor should eliminate the guest probably 
trigger even more severe consequences, e.g. restart the whole VM etc...

Christian.

>> Andrey
>>
>>
>>> Regards,
>>> Christian.
>>>
>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>> [AMD Official Use Only]
>>>>
>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>
>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>> It makes the code to crash ... how could it be a fix ?
>>>>
>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>
>>>> Thanks
>>>> -------------------------------------------------------------------
>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>> -------------------------------------------------------------------
>>>> we are hiring software manager for CVS core team
>>>> -------------------------------------------------------------------
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>> Cc: daniel@ffwll.ch
>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>
>>>> Hi Jingwen,
>>>>
>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>
>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>
>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>
>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>
>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>> Hi Christian,
>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>
>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>
>>>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>
>>>>> Best Regards,
>>>>> Jingwen Chen
>>>>>
>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>
>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>
>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>> Emily Deng
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>> protection for SRIOV
>>>>>>>>>>
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>
>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>> -- 
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>> protection for SRIOV
>>>>>>>>>>
>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>> need for this.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>
>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>> -     */
>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>> -        return;
>>>>>>>>>>> -
>>>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>
>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>> 0, 0);
>>>>>>>>>>>
>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>
>>>>>>>>>>>       flr_done:
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>> -
>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>> */
>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>
>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>> -     */
>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>> -        return;
>>>>>>>>>>> -
>>>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>
>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>> 0, 0);
>>>>>>>>>>>
>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>
>>>>>>>>>>>       flr_done:
>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>> -
>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>> */
>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2021-12-22 22:05   ` Andrey Grodzovsky
@ 2022-01-05  9:54     ` Lazar, Lijo
  -1 siblings, 0 replies; 103+ messages in thread
From: Lazar, Lijo @ 2022-01-05  9:54 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, christian.koenig, Monk.Liu



On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
> Use reset domain wq also for non TDR gpu recovery trigers
> such as sysfs and RAS. We must serialize all possible
> GPU recoveries to gurantee no concurrency there.
> For TDR call the original recovery function directly since
> it's already executed from within the wq. For others just
> use a wrapper to qeueue work and wait on it to finish.
> 
> v2: Rename to amdgpu_recover_work_struct
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>   3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index b5ff76aae7e0..8e96b9a14452 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			      struct amdgpu_job* job);
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> +			      struct amdgpu_job *job);
>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7c063fd37389..258ec3c0b2af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>    * Returns 0 for success or an error on failure.
>    */
>   
> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   			      struct amdgpu_job *job)
>   {
>   	struct list_head device_list, *device_list_handle =  NULL;
> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +struct amdgpu_recover_work_struct {
> +	struct work_struct base;
> +	struct amdgpu_device *adev;
> +	struct amdgpu_job *job;
> +	int ret;
> +};
> +
> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> +{
> +	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> +
> +	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +}
> +/*
> + * Serialize gpu recover into reset domain single threaded wq
> + */
> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +				    struct amdgpu_job *job)
> +{
> +	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
> +
> +	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> +
> +	if (!queue_work(adev->reset_domain.wq, &work.base))
> +		return -EAGAIN;
> +

The decision to schedule a reset is made at this point. Subsequent 
accesses to hardware may not be reliable. So should the flag in_reset be 
set here itself rather than waiting for the work to start execution?

Also, what about having the reset_active or in_reset flag in the 
reset_domain itself?

Thanks,
Lijo

> +	flush_work(&work.base);
> +
> +	return work.ret;
> +}
> +
>   /**
>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>    *
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index bfc47bea23db..38c9fd7b7ad4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
> -		amdgpu_device_gpu_recover(ring->adev, job);
> +		amdgpu_device_gpu_recover_imp(ring->adev, job);
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2022-01-05  9:54     ` Lazar, Lijo
  0 siblings, 0 replies; 103+ messages in thread
From: Lazar, Lijo @ 2022-01-05  9:54 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: daniel, horace.chen, christian.koenig, Monk.Liu



On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
> Use reset domain wq also for non TDR gpu recovery trigers
> such as sysfs and RAS. We must serialize all possible
> GPU recoveries to gurantee no concurrency there.
> For TDR call the original recovery function directly since
> it's already executed from within the wq. For others just
> use a wrapper to qeueue work and wait on it to finish.
> 
> v2: Rename to amdgpu_recover_work_struct
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>   3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index b5ff76aae7e0..8e96b9a14452 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			      struct amdgpu_job* job);
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> +			      struct amdgpu_job *job);
>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7c063fd37389..258ec3c0b2af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>    * Returns 0 for success or an error on failure.
>    */
>   
> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   			      struct amdgpu_job *job)
>   {
>   	struct list_head device_list, *device_list_handle =  NULL;
> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +struct amdgpu_recover_work_struct {
> +	struct work_struct base;
> +	struct amdgpu_device *adev;
> +	struct amdgpu_job *job;
> +	int ret;
> +};
> +
> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> +{
> +	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> +
> +	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +}
> +/*
> + * Serialize gpu recover into reset domain single threaded wq
> + */
> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +				    struct amdgpu_job *job)
> +{
> +	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
> +
> +	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> +
> +	if (!queue_work(adev->reset_domain.wq, &work.base))
> +		return -EAGAIN;
> +

The decision to schedule a reset is made at this point. Subsequent 
accesses to hardware may not be reliable. So should the flag in_reset be 
set here itself rather than waiting for the work to start execution?

Also, what about having the reset_active or in_reset flag in the 
reset_domain itself?

Thanks,
Lijo

> +	flush_work(&work.base);
> +
> +	return work.ret;
> +}
> +
>   /**
>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>    *
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index bfc47bea23db..38c9fd7b7ad4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
> -		amdgpu_device_gpu_recover(ring->adev, job);
> +		amdgpu_device_gpu_recover_imp(ring->adev, job);
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-05  9:54     ` Lazar, Lijo
@ 2022-01-05 12:31       ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05 12:31 UTC (permalink / raw)
  To: Lazar, Lijo, Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu

Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>> Use reset domain wq also for non TDR gpu recovery trigers
>> such as sysfs and RAS. We must serialize all possible
>> GPU recoveries to gurantee no concurrency there.
>> For TDR call the original recovery function directly since
>> it's already executed from within the wq. For others just
>> use a wrapper to qeueue work and wait on it to finish.
>>
>> v2: Rename to amdgpu_recover_work_struct
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index b5ff76aae7e0..8e96b9a14452 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>> amdgpu_device *adev);
>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>                     struct amdgpu_job* job);
>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>> +                  struct amdgpu_job *job);
>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 7c063fd37389..258ec3c0b2af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>    * Returns 0 for success or an error on failure.
>>    */
>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>                     struct amdgpu_job *job)
>>   {
>>       struct list_head device_list, *device_list_handle =  NULL;
>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>       return r;
>>   }
>>   +struct amdgpu_recover_work_struct {
>> +    struct work_struct base;
>> +    struct amdgpu_device *adev;
>> +    struct amdgpu_job *job;
>> +    int ret;
>> +};
>> +
>> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct 
>> *work)
>> +{
>> +    struct amdgpu_recover_work_struct *recover_work = 
>> container_of(work, struct amdgpu_recover_work_struct, base);
>> +
>> +    recover_work->ret = 
>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>> +}
>> +/*
>> + * Serialize gpu recover into reset domain single threaded wq
>> + */
>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>> +                    struct amdgpu_job *job)
>> +{
>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>> job};
>> +
>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>> +
>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>> +        return -EAGAIN;
>> +
>
> The decision to schedule a reset is made at this point. Subsequent 
> accesses to hardware may not be reliable. So should the flag in_reset 
> be set here itself rather than waiting for the work to start execution?

No, when we race and lose the VM is completely lost and probably 
restarted by the hypervisor.

And when we race and win we properly set the flag before signaling the 
hypervisor that it can continue with the reset.

> Also, what about having the reset_active or in_reset flag in the 
> reset_domain itself?

Of hand that sounds like a good idea.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>> +    flush_work(&work.base);
>> +
>> +    return work.ret;
>> +}
>> +
>>   /**
>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>    *
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index bfc47bea23db..38c9fd7b7ad4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>> -        amdgpu_device_gpu_recover(ring->adev, job);
>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>       } else {
>>           drm_sched_suspend_timeout(&ring->sched);
>>           if (amdgpu_sriov_vf(adev))
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2022-01-05 12:31       ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05 12:31 UTC (permalink / raw)
  To: Lazar, Lijo, Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, daniel, Monk.Liu

Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>> Use reset domain wq also for non TDR gpu recovery trigers
>> such as sysfs and RAS. We must serialize all possible
>> GPU recoveries to gurantee no concurrency there.
>> For TDR call the original recovery function directly since
>> it's already executed from within the wq. For others just
>> use a wrapper to qeueue work and wait on it to finish.
>>
>> v2: Rename to amdgpu_recover_work_struct
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index b5ff76aae7e0..8e96b9a14452 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>> amdgpu_device *adev);
>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>                     struct amdgpu_job* job);
>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>> +                  struct amdgpu_job *job);
>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 7c063fd37389..258ec3c0b2af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>    * Returns 0 for success or an error on failure.
>>    */
>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>                     struct amdgpu_job *job)
>>   {
>>       struct list_head device_list, *device_list_handle =  NULL;
>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>       return r;
>>   }
>>   +struct amdgpu_recover_work_struct {
>> +    struct work_struct base;
>> +    struct amdgpu_device *adev;
>> +    struct amdgpu_job *job;
>> +    int ret;
>> +};
>> +
>> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct 
>> *work)
>> +{
>> +    struct amdgpu_recover_work_struct *recover_work = 
>> container_of(work, struct amdgpu_recover_work_struct, base);
>> +
>> +    recover_work->ret = 
>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>> +}
>> +/*
>> + * Serialize gpu recover into reset domain single threaded wq
>> + */
>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>> +                    struct amdgpu_job *job)
>> +{
>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>> job};
>> +
>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>> +
>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>> +        return -EAGAIN;
>> +
>
> The decision to schedule a reset is made at this point. Subsequent 
> accesses to hardware may not be reliable. So should the flag in_reset 
> be set here itself rather than waiting for the work to start execution?

No, when we race and lose the VM is completely lost and probably 
restarted by the hypervisor.

And when we race and win we properly set the flag before signaling the 
hypervisor that it can continue with the reset.

> Also, what about having the reset_active or in_reset flag in the 
> reset_domain itself?

Of hand that sounds like a good idea.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>> +    flush_work(&work.base);
>> +
>> +    return work.ret;
>> +}
>> +
>>   /**
>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>    *
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index bfc47bea23db..38c9fd7b7ad4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>> -        amdgpu_device_gpu_recover(ring->adev, job);
>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>       } else {
>>           drm_sched_suspend_timeout(&ring->sched);
>>           if (amdgpu_sriov_vf(adev))
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-05 12:31       ` Christian König
@ 2022-01-05 13:11         ` Lazar, Lijo
  -1 siblings, 0 replies; 103+ messages in thread
From: Lazar, Lijo @ 2022-01-05 13:11 UTC (permalink / raw)
  To: Christian König, Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, Monk.Liu



On 1/5/2022 6:01 PM, Christian König wrote:
> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>> Use reset domain wq also for non TDR gpu recovery trigers
>>> such as sysfs and RAS. We must serialize all possible
>>> GPU recoveries to gurantee no concurrency there.
>>> For TDR call the original recovery function directly since
>>> it's already executed from within the wq. For others just
>>> use a wrapper to qeueue work and wait on it to finish.
>>>
>>> v2: Rename to amdgpu_recover_work_struct
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index b5ff76aae7e0..8e96b9a14452 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>> amdgpu_device *adev);
>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                     struct amdgpu_job* job);
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>> +                  struct amdgpu_job *job);
>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 7c063fd37389..258ec3c0b2af 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>    * Returns 0 for success or an error on failure.
>>>    */
>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>                     struct amdgpu_job *job)
>>>   {
>>>       struct list_head device_list, *device_list_handle =  NULL;
>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>       return r;
>>>   }
>>>   +struct amdgpu_recover_work_struct {
>>> +    struct work_struct base;
>>> +    struct amdgpu_device *adev;
>>> +    struct amdgpu_job *job;
>>> +    int ret;
>>> +};
>>> +
>>> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct 
>>> *work)
>>> +{
>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>> +
>>> +    recover_work->ret = 
>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>> +}
>>> +/*
>>> + * Serialize gpu recover into reset domain single threaded wq
>>> + */
>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +                    struct amdgpu_job *job)
>>> +{
>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>> job};
>>> +
>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>> +
>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>> +        return -EAGAIN;
>>> +
>>
>> The decision to schedule a reset is made at this point. Subsequent 
>> accesses to hardware may not be reliable. So should the flag in_reset 
>> be set here itself rather than waiting for the work to start execution?
> 
> No, when we race and lose the VM is completely lost and probably 
> restarted by the hypervisor.
> 
> And when we race and win we properly set the flag before signaling the 
> hypervisor that it can continue with the reset.
> 

I was talking about baremetal case. When this was synchronous, in_reset 
flag is set as one of the first things and amdgpu_in_reset is checked to 
prevent further hardware accesses. This design only changes the recover 
part and doesn't change the hardware perspective. Potential accesses 
from other processes need to be blocked as soon as we determine a reset 
is required. Are we expecting the work to be immediately executed and 
set the flags?

Thanks,
Lijo

>> Also, what about having the reset_active or in_reset flag in the 
>> reset_domain itself?
> 
> Of hand that sounds like a good idea.
> 
> Regards,
> Christian.
> 
>>
>> Thanks,
>> Lijo
>>
>>> +    flush_work(&work.base);
>>> +
>>> +    return work.ret;
>>> +}
>>> +
>>>   /**
>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>    *
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>       } else {
>>>           drm_sched_suspend_timeout(&ring->sched);
>>>           if (amdgpu_sriov_vf(adev))
>>>
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2022-01-05 13:11         ` Lazar, Lijo
  0 siblings, 0 replies; 103+ messages in thread
From: Lazar, Lijo @ 2022-01-05 13:11 UTC (permalink / raw)
  To: Christian König, Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, daniel, Monk.Liu



On 1/5/2022 6:01 PM, Christian König wrote:
> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>> Use reset domain wq also for non TDR gpu recovery trigers
>>> such as sysfs and RAS. We must serialize all possible
>>> GPU recoveries to gurantee no concurrency there.
>>> For TDR call the original recovery function directly since
>>> it's already executed from within the wq. For others just
>>> use a wrapper to qeueue work and wait on it to finish.
>>>
>>> v2: Rename to amdgpu_recover_work_struct
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index b5ff76aae7e0..8e96b9a14452 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>> amdgpu_device *adev);
>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                     struct amdgpu_job* job);
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>> +                  struct amdgpu_job *job);
>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 7c063fd37389..258ec3c0b2af 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>    * Returns 0 for success or an error on failure.
>>>    */
>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>                     struct amdgpu_job *job)
>>>   {
>>>       struct list_head device_list, *device_list_handle =  NULL;
>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>       return r;
>>>   }
>>>   +struct amdgpu_recover_work_struct {
>>> +    struct work_struct base;
>>> +    struct amdgpu_device *adev;
>>> +    struct amdgpu_job *job;
>>> +    int ret;
>>> +};
>>> +
>>> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct 
>>> *work)
>>> +{
>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>> +
>>> +    recover_work->ret = 
>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>> +}
>>> +/*
>>> + * Serialize gpu recover into reset domain single threaded wq
>>> + */
>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +                    struct amdgpu_job *job)
>>> +{
>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>> job};
>>> +
>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>> +
>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>> +        return -EAGAIN;
>>> +
>>
>> The decision to schedule a reset is made at this point. Subsequent 
>> accesses to hardware may not be reliable. So should the flag in_reset 
>> be set here itself rather than waiting for the work to start execution?
> 
> No, when we race and lose the VM is completely lost and probably 
> restarted by the hypervisor.
> 
> And when we race and win we properly set the flag before signaling the 
> hypervisor that it can continue with the reset.
> 

I was talking about baremetal case. When this was synchronous, in_reset 
flag is set as one of the first things and amdgpu_in_reset is checked to 
prevent further hardware accesses. This design only changes the recover 
part and doesn't change the hardware perspective. Potential accesses 
from other processes need to be blocked as soon as we determine a reset 
is required. Are we expecting the work to be immediately executed and 
set the flags?

Thanks,
Lijo

>> Also, what about having the reset_active or in_reset flag in the 
>> reset_domain itself?
> 
> Of hand that sounds like a good idea.
> 
> Regards,
> Christian.
> 
>>
>> Thanks,
>> Lijo
>>
>>> +    flush_work(&work.base);
>>> +
>>> +    return work.ret;
>>> +}
>>> +
>>>   /**
>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>    *
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>       } else {
>>>           drm_sched_suspend_timeout(&ring->sched);
>>>           if (amdgpu_sriov_vf(adev))
>>>
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-05 13:11         ` Lazar, Lijo
@ 2022-01-05 13:15           ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05 13:15 UTC (permalink / raw)
  To: Lazar, Lijo, Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu

Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
> On 1/5/2022 6:01 PM, Christian König wrote:
>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>> such as sysfs and RAS. We must serialize all possible
>>>> GPU recoveries to gurantee no concurrency there.
>>>> For TDR call the original recovery function directly since
>>>> it's already executed from within the wq. For others just
>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>
>>>> v2: Rename to amdgpu_recover_work_struct
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>> +++++++++++++++++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>> amdgpu_device *adev);
>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>                     struct amdgpu_job* job);
>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>> +                  struct amdgpu_job *job);
>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>    * Returns 0 for success or an error on failure.
>>>>    */
>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>                     struct amdgpu_job *job)
>>>>   {
>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>       return r;
>>>>   }
>>>>   +struct amdgpu_recover_work_struct {
>>>> +    struct work_struct base;
>>>> +    struct amdgpu_device *adev;
>>>> +    struct amdgpu_job *job;
>>>> +    int ret;
>>>> +};
>>>> +
>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>> work_struct *work)
>>>> +{
>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>> +
>>>> +    recover_work->ret = 
>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>> +}
>>>> +/*
>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>> + */
>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>> +                    struct amdgpu_job *job)
>>>> +{
>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>>> job};
>>>> +
>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>> +
>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>> +        return -EAGAIN;
>>>> +
>>>
>>> The decision to schedule a reset is made at this point. Subsequent 
>>> accesses to hardware may not be reliable. So should the flag 
>>> in_reset be set here itself rather than waiting for the work to 
>>> start execution?
>>
>> No, when we race and lose the VM is completely lost and probably 
>> restarted by the hypervisor.
>>
>> And when we race and win we properly set the flag before signaling 
>> the hypervisor that it can continue with the reset.
>>
>
> I was talking about baremetal case. When this was synchronous, 
> in_reset flag is set as one of the first things and amdgpu_in_reset is 
> checked to prevent further hardware accesses. This design only changes 
> the recover part and doesn't change the hardware perspective. 

> Potential accesses from other processes need to be blocked as soon as 
> we determine a reset is required.

That's an incorrect assumption.

Accessing the hardware is perfectly ok as long as the reset hasn't 
started yet. In other words even when the hardware is locked up you can 
still happily read/write registers or access the VRAM BAR.

Only when the hardware is currently performing a reset, then we can't 
touch it or there might be unfortunate consequences (usually complete 
system lockup).

Regards,
Christian.

> Are we expecting the work to be immediately executed and set the flags?
>
> Thanks,
> Lijo
>
>>> Also, what about having the reset_active or in_reset flag in the 
>>> reset_domain itself?
>>
>> Of hand that sounds like a good idea.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> +    flush_work(&work.base);
>>>> +
>>>> +    return work.ret;
>>>> +}
>>>> +
>>>>   /**
>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>>    *
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>       } else {
>>>>           drm_sched_suspend_timeout(&ring->sched);
>>>>           if (amdgpu_sriov_vf(adev))
>>>>
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2022-01-05 13:15           ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05 13:15 UTC (permalink / raw)
  To: Lazar, Lijo, Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, daniel, Monk.Liu

Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
> On 1/5/2022 6:01 PM, Christian König wrote:
>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>> such as sysfs and RAS. We must serialize all possible
>>>> GPU recoveries to gurantee no concurrency there.
>>>> For TDR call the original recovery function directly since
>>>> it's already executed from within the wq. For others just
>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>
>>>> v2: Rename to amdgpu_recover_work_struct
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>> +++++++++++++++++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>> amdgpu_device *adev);
>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>                     struct amdgpu_job* job);
>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>> +                  struct amdgpu_job *job);
>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>    * Returns 0 for success or an error on failure.
>>>>    */
>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>                     struct amdgpu_job *job)
>>>>   {
>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>       return r;
>>>>   }
>>>>   +struct amdgpu_recover_work_struct {
>>>> +    struct work_struct base;
>>>> +    struct amdgpu_device *adev;
>>>> +    struct amdgpu_job *job;
>>>> +    int ret;
>>>> +};
>>>> +
>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>> work_struct *work)
>>>> +{
>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>> +
>>>> +    recover_work->ret = 
>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>> +}
>>>> +/*
>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>> + */
>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>> +                    struct amdgpu_job *job)
>>>> +{
>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>>> job};
>>>> +
>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>> +
>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>> +        return -EAGAIN;
>>>> +
>>>
>>> The decision to schedule a reset is made at this point. Subsequent 
>>> accesses to hardware may not be reliable. So should the flag 
>>> in_reset be set here itself rather than waiting for the work to 
>>> start execution?
>>
>> No, when we race and lose the VM is completely lost and probably 
>> restarted by the hypervisor.
>>
>> And when we race and win we properly set the flag before signaling 
>> the hypervisor that it can continue with the reset.
>>
>
> I was talking about baremetal case. When this was synchronous, 
> in_reset flag is set as one of the first things and amdgpu_in_reset is 
> checked to prevent further hardware accesses. This design only changes 
> the recover part and doesn't change the hardware perspective. 

> Potential accesses from other processes need to be blocked as soon as 
> we determine a reset is required.

That's an incorrect assumption.

Accessing the hardware is perfectly ok as long as the reset hasn't 
started yet. In other words even when the hardware is locked up you can 
still happily read/write registers or access the VRAM BAR.

Only when the hardware is currently performing a reset, then we can't 
touch it or there might be unfortunate consequences (usually complete 
system lockup).

Regards,
Christian.

> Are we expecting the work to be immediately executed and set the flags?
>
> Thanks,
> Lijo
>
>>> Also, what about having the reset_active or in_reset flag in the 
>>> reset_domain itself?
>>
>> Of hand that sounds like a good idea.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> +    flush_work(&work.base);
>>>> +
>>>> +    return work.ret;
>>>> +}
>>>> +
>>>>   /**
>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>>    *
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>       } else {
>>>>           drm_sched_suspend_timeout(&ring->sched);
>>>>           if (amdgpu_sriov_vf(adev))
>>>>
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-05 13:15           ` Christian König
@ 2022-01-05 13:26             ` Lazar, Lijo
  -1 siblings, 0 replies; 103+ messages in thread
From: Lazar, Lijo @ 2022-01-05 13:26 UTC (permalink / raw)
  To: Christian König, Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, Monk.Liu



On 1/5/2022 6:45 PM, Christian König wrote:
> Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
>> On 1/5/2022 6:01 PM, Christian König wrote:
>>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>>> such as sysfs and RAS. We must serialize all possible
>>>>> GPU recoveries to gurantee no concurrency there.
>>>>> For TDR call the original recovery function directly since
>>>>> it's already executed from within the wq. For others just
>>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>>
>>>>> v2: Rename to amdgpu_recover_work_struct
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>>> +++++++++++++++++++++-
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>>> amdgpu_device *adev);
>>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job* job);
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>> +                  struct amdgpu_job *job);
>>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>    * Returns 0 for success or an error on failure.
>>>>>    */
>>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job *job)
>>>>>   {
>>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>>       return r;
>>>>>   }
>>>>>   +struct amdgpu_recover_work_struct {
>>>>> +    struct work_struct base;
>>>>> +    struct amdgpu_device *adev;
>>>>> +    struct amdgpu_job *job;
>>>>> +    int ret;
>>>>> +};
>>>>> +
>>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>>> work_struct *work)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>> +
>>>>> +    recover_work->ret = 
>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>>> +}
>>>>> +/*
>>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>>> + */
>>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +                    struct amdgpu_job *job)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>>>> job};
>>>>> +
>>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>>> +
>>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>>> +        return -EAGAIN;
>>>>> +
>>>>
>>>> The decision to schedule a reset is made at this point. Subsequent 
>>>> accesses to hardware may not be reliable. So should the flag 
>>>> in_reset be set here itself rather than waiting for the work to 
>>>> start execution?
>>>
>>> No, when we race and lose the VM is completely lost and probably 
>>> restarted by the hypervisor.
>>>
>>> And when we race and win we properly set the flag before signaling 
>>> the hypervisor that it can continue with the reset.
>>>
>>
>> I was talking about baremetal case. When this was synchronous, 
>> in_reset flag is set as one of the first things and amdgpu_in_reset is 
>> checked to prevent further hardware accesses. This design only changes 
>> the recover part and doesn't change the hardware perspective. 
> 
>> Potential accesses from other processes need to be blocked as soon as 
>> we determine a reset is required.
> 
> That's an incorrect assumption.
> 
> Accessing the hardware is perfectly ok as long as the reset hasn't 
> started yet. In other words even when the hardware is locked up you can 
> still happily read/write registers or access the VRAM BAR.
> 

Not sure if that is 100% correct like a recovery triggered by RAS error 
(depends on the access done).

Thanks,
Lijo

> Only when the hardware is currently performing a reset, then we can't 
> touch it or there might be unfortunate consequences (usually complete 
> system lockup).
> 
> Regards,
> Christian.
> 
>> Are we expecting the work to be immediately executed and set the flags?
>>
>> Thanks,
>> Lijo
>>
>>>> Also, what about having the reset_active or in_reset flag in the 
>>>> reset_domain itself?
>>>
>>> Of hand that sounds like a good idea.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> +    flush_work(&work.base);
>>>>> +
>>>>> +    return work.ret;
>>>>> +}
>>>>> +
>>>>>   /**
>>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>>>    *
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>>       } else {
>>>>>           drm_sched_suspend_timeout(&ring->sched);
>>>>>           if (amdgpu_sriov_vf(adev))
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2022-01-05 13:26             ` Lazar, Lijo
  0 siblings, 0 replies; 103+ messages in thread
From: Lazar, Lijo @ 2022-01-05 13:26 UTC (permalink / raw)
  To: Christian König, Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, daniel, Monk.Liu



On 1/5/2022 6:45 PM, Christian König wrote:
> Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
>> On 1/5/2022 6:01 PM, Christian König wrote:
>>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>>> such as sysfs and RAS. We must serialize all possible
>>>>> GPU recoveries to gurantee no concurrency there.
>>>>> For TDR call the original recovery function directly since
>>>>> it's already executed from within the wq. For others just
>>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>>
>>>>> v2: Rename to amdgpu_recover_work_struct
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>>> +++++++++++++++++++++-
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>>> amdgpu_device *adev);
>>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job* job);
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>> +                  struct amdgpu_job *job);
>>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>    * Returns 0 for success or an error on failure.
>>>>>    */
>>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job *job)
>>>>>   {
>>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>>       return r;
>>>>>   }
>>>>>   +struct amdgpu_recover_work_struct {
>>>>> +    struct work_struct base;
>>>>> +    struct amdgpu_device *adev;
>>>>> +    struct amdgpu_job *job;
>>>>> +    int ret;
>>>>> +};
>>>>> +
>>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>>> work_struct *work)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>> +
>>>>> +    recover_work->ret = 
>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>>> +}
>>>>> +/*
>>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>>> + */
>>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +                    struct amdgpu_job *job)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>>>> job};
>>>>> +
>>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>>> +
>>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>>> +        return -EAGAIN;
>>>>> +
>>>>
>>>> The decision to schedule a reset is made at this point. Subsequent 
>>>> accesses to hardware may not be reliable. So should the flag 
>>>> in_reset be set here itself rather than waiting for the work to 
>>>> start execution?
>>>
>>> No, when we race and lose the VM is completely lost and probably 
>>> restarted by the hypervisor.
>>>
>>> And when we race and win we properly set the flag before signaling 
>>> the hypervisor that it can continue with the reset.
>>>
>>
>> I was talking about baremetal case. When this was synchronous, 
>> in_reset flag is set as one of the first things and amdgpu_in_reset is 
>> checked to prevent further hardware accesses. This design only changes 
>> the recover part and doesn't change the hardware perspective. 
> 
>> Potential accesses from other processes need to be blocked as soon as 
>> we determine a reset is required.
> 
> That's an incorrect assumption.
> 
> Accessing the hardware is perfectly ok as long as the reset hasn't 
> started yet. In other words even when the hardware is locked up you can 
> still happily read/write registers or access the VRAM BAR.
> 

Not sure if that is 100% correct like a recovery triggered by RAS error 
(depends on the access done).

Thanks,
Lijo

> Only when the hardware is currently performing a reset, then we can't 
> touch it or there might be unfortunate consequences (usually complete 
> system lockup).
> 
> Regards,
> Christian.
> 
>> Are we expecting the work to be immediately executed and set the flags?
>>
>> Thanks,
>> Lijo
>>
>>>> Also, what about having the reset_active or in_reset flag in the 
>>>> reset_domain itself?
>>>
>>> Of hand that sounds like a good idea.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> +    flush_work(&work.base);
>>>>> +
>>>>> +    return work.ret;
>>>>> +}
>>>>> +
>>>>>   /**
>>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>>>    *
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>>       } else {
>>>>>           drm_sched_suspend_timeout(&ring->sched);
>>>>>           if (amdgpu_sriov_vf(adev))
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-05 13:26             ` Lazar, Lijo
@ 2022-01-05 13:41               ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05 13:41 UTC (permalink / raw)
  To: Lazar, Lijo, Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: horace.chen, Monk.Liu

Am 05.01.22 um 14:26 schrieb Lazar, Lijo:
> On 1/5/2022 6:45 PM, Christian König wrote:
>> Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
>>> On 1/5/2022 6:01 PM, Christian König wrote:
>>>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>>>> such as sysfs and RAS. We must serialize all possible
>>>>>> GPU recoveries to gurantee no concurrency there.
>>>>>> For TDR call the original recovery function directly since
>>>>>> it's already executed from within the wq. For others just
>>>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>>>
>>>>>> v2: Rename to amdgpu_recover_work_struct
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>>>> +++++++++++++++++++++-
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>>>> amdgpu_device *adev);
>>>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>                     struct amdgpu_job* job);
>>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>> +                  struct amdgpu_job *job);
>>>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>    * Returns 0 for success or an error on failure.
>>>>>>    */
>>>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>>                     struct amdgpu_job *job)
>>>>>>   {
>>>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>>>> amdgpu_device *adev,
>>>>>>       return r;
>>>>>>   }
>>>>>>   +struct amdgpu_recover_work_struct {
>>>>>> +    struct work_struct base;
>>>>>> +    struct amdgpu_device *adev;
>>>>>> +    struct amdgpu_job *job;
>>>>>> +    int ret;
>>>>>> +};
>>>>>> +
>>>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>>>> work_struct *work)
>>>>>> +{
>>>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>>> +
>>>>>> +    recover_work->ret = 
>>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, 
>>>>>> recover_work->job);
>>>>>> +}
>>>>>> +/*
>>>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>>>> + */
>>>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>> +                    struct amdgpu_job *job)
>>>>>> +{
>>>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job 
>>>>>> = job};
>>>>>> +
>>>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>>>> +
>>>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>>>> +        return -EAGAIN;
>>>>>> +
>>>>>
>>>>> The decision to schedule a reset is made at this point. Subsequent 
>>>>> accesses to hardware may not be reliable. So should the flag 
>>>>> in_reset be set here itself rather than waiting for the work to 
>>>>> start execution?
>>>>
>>>> No, when we race and lose the VM is completely lost and probably 
>>>> restarted by the hypervisor.
>>>>
>>>> And when we race and win we properly set the flag before signaling 
>>>> the hypervisor that it can continue with the reset.
>>>>
>>>
>>> I was talking about baremetal case. When this was synchronous, 
>>> in_reset flag is set as one of the first things and amdgpu_in_reset 
>>> is checked to prevent further hardware accesses. This design only 
>>> changes the recover part and doesn't change the hardware perspective. 
>>
>>> Potential accesses from other processes need to be blocked as soon 
>>> as we determine a reset is required.
>>
>> That's an incorrect assumption.
>>
>> Accessing the hardware is perfectly ok as long as the reset hasn't 
>> started yet. In other words even when the hardware is locked up you 
>> can still happily read/write registers or access the VRAM BAR.
>>
>
> Not sure if that is 100% correct like a recovery triggered by RAS 
> error (depends on the access done).

Yeah, for RAS there should just be one error triggered as far as I know. 
Otherwise we have a problem because there can be any number of hardware 
accesses between RAS interrupt and setting the in_reset flag anyway.

There are some cases where we shouldn't access the hardware any more. 
E.g. we had cases of static discharge with external mining cases for 
example.

But in those case the hardware is so severely gone that the user should 
either replace it completely or at least power cycle the system.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>> Only when the hardware is currently performing a reset, then we can't 
>> touch it or there might be unfortunate consequences (usually complete 
>> system lockup).
>>
>> Regards,
>> Christian.
>>
>>> Are we expecting the work to be immediately executed and set the flags?
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>> Also, what about having the reset_active or in_reset flag in the 
>>>>> reset_domain itself?
>>>>
>>>> Of hand that sounds like a good idea.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> +    flush_work(&work.base);
>>>>>> +
>>>>>> +    return work.ret;
>>>>>> +}
>>>>>> +
>>>>>>   /**
>>>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE 
>>>>>> slot
>>>>>>    *
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>>>       } else {
>>>>>> drm_sched_suspend_timeout(&ring->sched);
>>>>>>           if (amdgpu_sriov_vf(adev))
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2022-01-05 13:41               ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-05 13:41 UTC (permalink / raw)
  To: Lazar, Lijo, Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: horace.chen, daniel, Monk.Liu

Am 05.01.22 um 14:26 schrieb Lazar, Lijo:
> On 1/5/2022 6:45 PM, Christian König wrote:
>> Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
>>> On 1/5/2022 6:01 PM, Christian König wrote:
>>>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>>>> such as sysfs and RAS. We must serialize all possible
>>>>>> GPU recoveries to gurantee no concurrency there.
>>>>>> For TDR call the original recovery function directly since
>>>>>> it's already executed from within the wq. For others just
>>>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>>>
>>>>>> v2: Rename to amdgpu_recover_work_struct
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>>>> +++++++++++++++++++++-
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>>>> amdgpu_device *adev);
>>>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>                     struct amdgpu_job* job);
>>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>> +                  struct amdgpu_job *job);
>>>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>    * Returns 0 for success or an error on failure.
>>>>>>    */
>>>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>>                     struct amdgpu_job *job)
>>>>>>   {
>>>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>>>> amdgpu_device *adev,
>>>>>>       return r;
>>>>>>   }
>>>>>>   +struct amdgpu_recover_work_struct {
>>>>>> +    struct work_struct base;
>>>>>> +    struct amdgpu_device *adev;
>>>>>> +    struct amdgpu_job *job;
>>>>>> +    int ret;
>>>>>> +};
>>>>>> +
>>>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>>>> work_struct *work)
>>>>>> +{
>>>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>>> +
>>>>>> +    recover_work->ret = 
>>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, 
>>>>>> recover_work->job);
>>>>>> +}
>>>>>> +/*
>>>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>>>> + */
>>>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>> +                    struct amdgpu_job *job)
>>>>>> +{
>>>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job 
>>>>>> = job};
>>>>>> +
>>>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>>>> +
>>>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>>>> +        return -EAGAIN;
>>>>>> +
>>>>>
>>>>> The decision to schedule a reset is made at this point. Subsequent 
>>>>> accesses to hardware may not be reliable. So should the flag 
>>>>> in_reset be set here itself rather than waiting for the work to 
>>>>> start execution?
>>>>
>>>> No, when we race and lose the VM is completely lost and probably 
>>>> restarted by the hypervisor.
>>>>
>>>> And when we race and win we properly set the flag before signaling 
>>>> the hypervisor that it can continue with the reset.
>>>>
>>>
>>> I was talking about baremetal case. When this was synchronous, 
>>> in_reset flag is set as one of the first things and amdgpu_in_reset 
>>> is checked to prevent further hardware accesses. This design only 
>>> changes the recover part and doesn't change the hardware perspective. 
>>
>>> Potential accesses from other processes need to be blocked as soon 
>>> as we determine a reset is required.
>>
>> That's an incorrect assumption.
>>
>> Accessing the hardware is perfectly ok as long as the reset hasn't 
>> started yet. In other words even when the hardware is locked up you 
>> can still happily read/write registers or access the VRAM BAR.
>>
>
> Not sure if that is 100% correct like a recovery triggered by RAS 
> error (depends on the access done).

Yeah, for RAS there should just be one error triggered as far as I know. 
Otherwise we have a problem because there can be any number of hardware 
accesses between RAS interrupt and setting the in_reset flag anyway.

There are some cases where we shouldn't access the hardware any more. 
E.g. we had cases of static discharge with external mining cases for 
example.

But in those case the hardware is so severely gone that the user should 
either replace it completely or at least power cycle the system.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>> Only when the hardware is currently performing a reset, then we can't 
>> touch it or there might be unfortunate consequences (usually complete 
>> system lockup).
>>
>> Regards,
>> Christian.
>>
>>> Are we expecting the work to be immediately executed and set the flags?
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>> Also, what about having the reset_active or in_reset flag in the 
>>>>> reset_domain itself?
>>>>
>>>> Of hand that sounds like a good idea.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> +    flush_work(&work.base);
>>>>>> +
>>>>>> +    return work.ret;
>>>>>> +}
>>>>>> +
>>>>>>   /**
>>>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE 
>>>>>> slot
>>>>>>    *
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>>>       } else {
>>>>>> drm_sched_suspend_timeout(&ring->sched);
>>>>>>           if (amdgpu_sriov_vf(adev))
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-05 12:31       ` Christian König
@ 2022-01-05 18:11         ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-05 18:11 UTC (permalink / raw)
  To: Christian König, Lazar, Lijo, dri-devel, amd-gfx
  Cc: horace.chen, Monk.Liu

On 2022-01-05 7:31 a.m., Christian König wrote:

> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>> Use reset domain wq also for non TDR gpu recovery trigers
>>> such as sysfs and RAS. We must serialize all possible
>>> GPU recoveries to gurantee no concurrency there.
>>> For TDR call the original recovery function directly since
>>> it's already executed from within the wq. For others just
>>> use a wrapper to qeueue work and wait on it to finish.
>>>
>>> v2: Rename to amdgpu_recover_work_struct
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>> +++++++++++++++++++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index b5ff76aae7e0..8e96b9a14452 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>> amdgpu_device *adev);
>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                     struct amdgpu_job* job);
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>> +                  struct amdgpu_job *job);
>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 7c063fd37389..258ec3c0b2af 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>    * Returns 0 for success or an error on failure.
>>>    */
>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>                     struct amdgpu_job *job)
>>>   {
>>>       struct list_head device_list, *device_list_handle = NULL;
>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>       return r;
>>>   }
>>>   +struct amdgpu_recover_work_struct {
>>> +    struct work_struct base;
>>> +    struct amdgpu_device *adev;
>>> +    struct amdgpu_job *job;
>>> +    int ret;
>>> +};
>>> +
>>> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct 
>>> *work)
>>> +{
>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>> +
>>> +    recover_work->ret = 
>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>> +}
>>> +/*
>>> + * Serialize gpu recover into reset domain single threaded wq
>>> + */
>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +                    struct amdgpu_job *job)
>>> +{
>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>> job};
>>> +
>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>> +
>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>> +        return -EAGAIN;
>>> +
>>
>> The decision to schedule a reset is made at this point. Subsequent 
>> accesses to hardware may not be reliable. So should the flag in_reset 
>> be set here itself rather than waiting for the work to start execution?
>
> No, when we race and lose the VM is completely lost and probably 
> restarted by the hypervisor.
>
> And when we race and win we properly set the flag before signaling the 
> hypervisor that it can continue with the reset.
>
>> Also, what about having the reset_active or in_reset flag in the 
>> reset_domain itself?
>
> Of hand that sounds like a good idea.


What then about the adev->reset_sem semaphore ? Should we also move this 
to reset_domain ?  Both of the moves have functional
implications only for XGMI case because there will be contention over 
accessing those single instance variables from multiple devices
while now each device has it's own copy.

What benefit the centralization into reset_domain gives - is it for 
example to prevent one device in a hive trying to access through MMIO 
another one's
VRAM (shared FB memory) while the other one goes through reset ?

Andrey


>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Lijo
>>
>>> +    flush_work(&work.base);
>>> +
>>> +    return work.ret;
>>> +}
>>> +
>>>   /**
>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>    *
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>       } else {
>>>           drm_sched_suspend_timeout(&ring->sched);
>>>           if (amdgpu_sriov_vf(adev))
>>>
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
@ 2022-01-05 18:11         ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-05 18:11 UTC (permalink / raw)
  To: Christian König, Lazar, Lijo, dri-devel, amd-gfx
  Cc: horace.chen, daniel, Monk.Liu

On 2022-01-05 7:31 a.m., Christian König wrote:

> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>> Use reset domain wq also for non TDR gpu recovery trigers
>>> such as sysfs and RAS. We must serialize all possible
>>> GPU recoveries to gurantee no concurrency there.
>>> For TDR call the original recovery function directly since
>>> it's already executed from within the wq. For others just
>>> use a wrapper to qeueue work and wait on it to finish.
>>>
>>> v2: Rename to amdgpu_recover_work_struct
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>> +++++++++++++++++++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index b5ff76aae7e0..8e96b9a14452 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>> amdgpu_device *adev);
>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                     struct amdgpu_job* job);
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>> +                  struct amdgpu_job *job);
>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 7c063fd37389..258ec3c0b2af 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>    * Returns 0 for success or an error on failure.
>>>    */
>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>                     struct amdgpu_job *job)
>>>   {
>>>       struct list_head device_list, *device_list_handle = NULL;
>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>       return r;
>>>   }
>>>   +struct amdgpu_recover_work_struct {
>>> +    struct work_struct base;
>>> +    struct amdgpu_device *adev;
>>> +    struct amdgpu_job *job;
>>> +    int ret;
>>> +};
>>> +
>>> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct 
>>> *work)
>>> +{
>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>> +
>>> +    recover_work->ret = 
>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>> +}
>>> +/*
>>> + * Serialize gpu recover into reset domain single threaded wq
>>> + */
>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> +                    struct amdgpu_job *job)
>>> +{
>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>> job};
>>> +
>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>> +
>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>> +        return -EAGAIN;
>>> +
>>
>> The decision to schedule a reset is made at this point. Subsequent 
>> accesses to hardware may not be reliable. So should the flag in_reset 
>> be set here itself rather than waiting for the work to start execution?
>
> No, when we race and lose the VM is completely lost and probably 
> restarted by the hypervisor.
>
> And when we race and win we properly set the flag before signaling the 
> hypervisor that it can continue with the reset.
>
>> Also, what about having the reset_active or in_reset flag in the 
>> reset_domain itself?
>
> Of hand that sounds like a good idea.


What then about the adev->reset_sem semaphore ? Should we also move this 
to reset_domain ?  Both of the moves have functional
implications only for XGMI case because there will be contention over 
accessing those single instance variables from multiple devices
while now each device has it's own copy.

What benefit the centralization into reset_domain gives - is it for 
example to prevent one device in a hive trying to access through MMIO 
another one's
VRAM (shared FB memory) while the other one goes through reset ?

Andrey


>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Lijo
>>
>>> +    flush_work(&work.base);
>>> +
>>> +    return work.ret;
>>> +}
>>> +
>>>   /**
>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>    *
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>       } else {
>>>           drm_sched_suspend_timeout(&ring->sched);
>>>           if (amdgpu_sriov_vf(adev))
>>>
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-05  7:59                               ` Christian König
@ 2022-01-05 18:24                                 ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-05 18:24 UTC (permalink / raw)
  To: Christian König, JingWen Chen, Christian König, Liu,
	Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace


On 2022-01-05 2:59 a.m., Christian König wrote:
> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source 
>>>>>>> of signaling the need for a reset, similar to each job timeout 
>>>>>>> on each queue. Otherwise you have a race condition between the 
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>> FLR is about to start or was already executed, but host will do 
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>> Then we have a major design issue in the SRIOV protocol and really 
>>>> need to question this.
>>>>
>>>> How do you want to prevent a race between the hypervisor resetting 
>>>> the hardware and the client trying the same because of a timeout?
>>>>
>>>> As far as I can see the procedure should be:
>>>> 1. We detect that a reset is necessary, either because of a fault a 
>>>> timeout or signal from hypervisor.
>>>> 2. For each of those potential reset sources a work item is send to 
>>>> the single workqueue.
>>>> 3. One of those work items execute first and prepares the reset.
>>>> 4. We either do the reset our self or notify the hypervisor that we 
>>>> are ready for the reset.
>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>> 6. Cancel work items which might have been scheduled from other 
>>>> reset sources.
>>>>
>>>> It does make sense that the hypervisor resets the hardware without 
>>>> waiting for the clients for too long, but if we don't follow this 
>>>> general steps we will always have a race between the different 
>>>> components.
>>>
>>> Monk, just to add to this - if indeed as you say that 'FLR from 
>>> hypervisor is just to notify guest the hw VF FLR is about to start 
>>> or was already executed, but host will do FLR anyway without waiting 
>>> for guest too long'
>>> and there is no strict waiting from the hypervisor for 
>>> IDH_READY_TO_RESET to be recived from guest before starting the 
>>> reset then setting in_gpu_reset and locking reset_sem from guest 
>>> side is not really full proof
>>> protection from MMIO accesses by the guest - it only truly helps if 
>>> hypervisor waits for that message before initiation of HW reset.
>>>
>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and 
>> never has the chance to send the response back, then other VFs will 
>> have to wait it reset. All the vfs will hang in this case. Or 
>> sometimes the mailbox has some delay and other VFs will also wait. 
>> The user of other VFs will be affected in this case.
>
> Yeah, agree completely with JingWen. The hypervisor is the one in 
> charge here, not the guest.
>
> What the hypervisor should do (and it already seems to be designed 
> that way) is to send the guest a message that a reset is about to 
> happen and give it some time to response appropriately.
>
> The guest on the other hand then tells the hypervisor that all 
> processing has stopped and it is ready to restart. If that doesn't 
> happen in time the hypervisor should eliminate the guest probably 
> trigger even more severe consequences, e.g. restart the whole VM etc...
>
> Christian.


So what's the end conclusion here regarding dropping this particular 
patch ? Seems to me we still need to drop it to prevent driver's MMIO access
to the GPU during reset from various places in the code.

Andrey


>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source 
>>>>>>> of signaling the need for a reset, similar to each job timeout 
>>>>>>> on each queue. Otherwise you have a race condition between the 
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>> FLR is about to start or was already executed, but host will do 
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>> implementation is severely broken and what Andrey is doing is 
>>>>>>> actually fixing it.
>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>
>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the 
>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>
>>>>> Thanks
>>>>> -------------------------------------------------------------------
>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>> -------------------------------------------------------------------
>>>>> we are hiring software manager for CVS core team
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
>>>>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
>>>>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 
>>>>> Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
>>>>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>>>>> Chen, JingWen <JingWen.Chen2@amd.com>
>>>>> Cc: daniel@ffwll.ch
>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>> protection for SRIOV
>>>>>
>>>>> Hi Jingwen,
>>>>>
>>>>> well what I mean is that we need to adjust the implementation in 
>>>>> amdgpu to actually match the requirements.
>>>>>
>>>>> Could be that the reset sequence is questionable in general, but I 
>>>>> doubt so at least for now.
>>>>>
>>>>> See the FLR request from the hypervisor is just another source of 
>>>>> signaling the need for a reset, similar to each job timeout on 
>>>>> each queue. Otherwise you have a race condition between the 
>>>>> hypervisor and the scheduler.
>>>>>
>>>>> Properly setting in_gpu_reset is indeed mandatory, but should 
>>>>> happen at a central place and not in the SRIOV specific code.
>>>>>
>>>>> In other words I strongly think that the current SRIOV reset 
>>>>> implementation is severely broken and what Andrey is doing is 
>>>>> actually fixing it.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>> Hi Christian,
>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>>>>> driver".
>>>>>>
>>>>>> Do you mean we should change the reset sequence in SRIOV? This 
>>>>>> will be a huge change for our SRIOV solution.
>>>>>>
>>>>>>    From my point of view, we can directly use 
>>>>>> amdgpu_device_lock_adev
>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock 
>>>>>> since no one will conflict with this thread with reset_domain 
>>>>>> introduced.
>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep 
>>>>>> device untouched via user space.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jingwen Chen
>>>>>>
>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>>>>> procedure.
>>>>>>>
>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not 
>>>>>>> the driver.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>>>>> first, and do the flr, guest side thread may not know this and 
>>>>>>>>> still try to access HW(e.g. kfd is using a lot of 
>>>>>>>>> amdgpu_in_reset and reset_sem to identify the reset status). 
>>>>>>>>> And this may lead to very bad result.
>>>>>>>>>
>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>> These patches look good to me. JingWen will pull these 
>>>>>>>>>> patches and do some basic TDR test on sriov environment, and 
>>>>>>>>>> give feedback.
>>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>> Emily Deng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- 
>>>>>>>>>>> gfx@lists.freedesktop.org;
>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>
>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there 
>>>>>>>>>>>> is no
>>>>>>>>>>>> need for this.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>> received,
>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> -        return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, 
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>>           /* Trigger recovery for world switch failure if 
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>> received,
>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> -        return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, 
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>>           /* Trigger recovery for world switch failure if 
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-05 18:24                                 ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-05 18:24 UTC (permalink / raw)
  To: Christian König, JingWen Chen, Christian König, Liu,
	Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel


On 2022-01-05 2:59 a.m., Christian König wrote:
> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source 
>>>>>>> of signaling the need for a reset, similar to each job timeout 
>>>>>>> on each queue. Otherwise you have a race condition between the 
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>> FLR is about to start or was already executed, but host will do 
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>> Then we have a major design issue in the SRIOV protocol and really 
>>>> need to question this.
>>>>
>>>> How do you want to prevent a race between the hypervisor resetting 
>>>> the hardware and the client trying the same because of a timeout?
>>>>
>>>> As far as I can see the procedure should be:
>>>> 1. We detect that a reset is necessary, either because of a fault a 
>>>> timeout or signal from hypervisor.
>>>> 2. For each of those potential reset sources a work item is send to 
>>>> the single workqueue.
>>>> 3. One of those work items execute first and prepares the reset.
>>>> 4. We either do the reset our self or notify the hypervisor that we 
>>>> are ready for the reset.
>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>> 6. Cancel work items which might have been scheduled from other 
>>>> reset sources.
>>>>
>>>> It does make sense that the hypervisor resets the hardware without 
>>>> waiting for the clients for too long, but if we don't follow this 
>>>> general steps we will always have a race between the different 
>>>> components.
>>>
>>> Monk, just to add to this - if indeed as you say that 'FLR from 
>>> hypervisor is just to notify guest the hw VF FLR is about to start 
>>> or was already executed, but host will do FLR anyway without waiting 
>>> for guest too long'
>>> and there is no strict waiting from the hypervisor for 
>>> IDH_READY_TO_RESET to be recived from guest before starting the 
>>> reset then setting in_gpu_reset and locking reset_sem from guest 
>>> side is not really full proof
>>> protection from MMIO accesses by the guest - it only truly helps if 
>>> hypervisor waits for that message before initiation of HW reset.
>>>
>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and 
>> never has the chance to send the response back, then other VFs will 
>> have to wait it reset. All the vfs will hang in this case. Or 
>> sometimes the mailbox has some delay and other VFs will also wait. 
>> The user of other VFs will be affected in this case.
>
> Yeah, agree completely with JingWen. The hypervisor is the one in 
> charge here, not the guest.
>
> What the hypervisor should do (and it already seems to be designed 
> that way) is to send the guest a message that a reset is about to 
> happen and give it some time to response appropriately.
>
> The guest on the other hand then tells the hypervisor that all 
> processing has stopped and it is ready to restart. If that doesn't 
> happen in time the hypervisor should eliminate the guest probably 
> trigger even more severe consequences, e.g. restart the whole VM etc...
>
> Christian.


So what's the end conclusion here regarding dropping this particular 
patch ? Seems to me we still need to drop it to prevent driver's MMIO access
to the GPU during reset from various places in the code.

Andrey


>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source 
>>>>>>> of signaling the need for a reset, similar to each job timeout 
>>>>>>> on each queue. Otherwise you have a race condition between the 
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>> FLR is about to start or was already executed, but host will do 
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>> implementation is severely broken and what Andrey is doing is 
>>>>>>> actually fixing it.
>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>
>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the 
>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>
>>>>> Thanks
>>>>> -------------------------------------------------------------------
>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>> -------------------------------------------------------------------
>>>>> we are hiring software manager for CVS core team
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
>>>>> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
>>>>> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 
>>>>> Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
>>>>> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
>>>>> Chen, JingWen <JingWen.Chen2@amd.com>
>>>>> Cc: daniel@ffwll.ch
>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>> protection for SRIOV
>>>>>
>>>>> Hi Jingwen,
>>>>>
>>>>> well what I mean is that we need to adjust the implementation in 
>>>>> amdgpu to actually match the requirements.
>>>>>
>>>>> Could be that the reset sequence is questionable in general, but I 
>>>>> doubt so at least for now.
>>>>>
>>>>> See the FLR request from the hypervisor is just another source of 
>>>>> signaling the need for a reset, similar to each job timeout on 
>>>>> each queue. Otherwise you have a race condition between the 
>>>>> hypervisor and the scheduler.
>>>>>
>>>>> Properly setting in_gpu_reset is indeed mandatory, but should 
>>>>> happen at a central place and not in the SRIOV specific code.
>>>>>
>>>>> In other words I strongly think that the current SRIOV reset 
>>>>> implementation is severely broken and what Andrey is doing is 
>>>>> actually fixing it.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>> Hi Christian,
>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>>>>> driver".
>>>>>>
>>>>>> Do you mean we should change the reset sequence in SRIOV? This 
>>>>>> will be a huge change for our SRIOV solution.
>>>>>>
>>>>>>    From my point of view, we can directly use 
>>>>>> amdgpu_device_lock_adev
>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock 
>>>>>> since no one will conflict with this thread with reset_domain 
>>>>>> introduced.
>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep 
>>>>>> device untouched via user space.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jingwen Chen
>>>>>>
>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>>>>> procedure.
>>>>>>>
>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not 
>>>>>>> the driver.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>>>>> first, and do the flr, guest side thread may not know this and 
>>>>>>>>> still try to access HW(e.g. kfd is using a lot of 
>>>>>>>>> amdgpu_in_reset and reset_sem to identify the reset status). 
>>>>>>>>> And this may lead to very bad result.
>>>>>>>>>
>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>> These patches look good to me. JingWen will pull these 
>>>>>>>>>> patches and do some basic TDR test on sriov environment, and 
>>>>>>>>>> give feedback.
>>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>> Emily Deng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- 
>>>>>>>>>>> gfx@lists.freedesktop.org;
>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>
>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there 
>>>>>>>>>>>> is no
>>>>>>>>>>>> need for this.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>> received,
>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> -        return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, 
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>>           /* Trigger recovery for world switch failure if 
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>> received,
>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> -        return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, 
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>>           /* Trigger recovery for world switch failure if 
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-05 18:24                                 ` Andrey Grodzovsky
@ 2022-01-06  4:59                                   ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-06  4:59 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace


On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>
> On 2022-01-05 2:59 a.m., Christian König wrote:
>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>
>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>
>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>
>>>>> As far as I can see the procedure should be:
>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>> 3. One of those work items execute first and prepares the reset.
>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>
>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>
>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>
>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>
>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>
>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>
>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>
>> Christian.
>
>
> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
> to the GPU during reset from various places in the code.
>
> Andrey
>
Hi Andrey & Christian,

I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:

[  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  428.400582] [drm] clean up the vf2pf work item
[  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
[  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
[  437.531392] amdgpu: qcm fence wait loop timeout expired
[  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
[  438.087443] [drm] RE-INIT-early: nv_common succeeded

As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.

>
>>
>>>> Andrey
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>
>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>
>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>
>>>>>> Thanks
>>>>>> -------------------------------------------------------------------
>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>> -------------------------------------------------------------------
>>>>>> we are hiring software manager for CVS core team
>>>>>> -------------------------------------------------------------------
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>> Cc: daniel@ffwll.ch
>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>
>>>>>> Hi Jingwen,
>>>>>>
>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>
>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>
>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>
>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>
>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>> Hi Christian,
>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>
>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>
>>>>>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Jingwen Chen
>>>>>>>
>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>
>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>
>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>
>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>
>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- 
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>> -     */
>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>> -
>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>> -
>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>> */
>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>> -     */
>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>> -
>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>> -
>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>> */
>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||
>>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-06  4:59                                   ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-06  4:59 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel


On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>
> On 2022-01-05 2:59 a.m., Christian König wrote:
>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>
>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>
>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>
>>>>> As far as I can see the procedure should be:
>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>> 3. One of those work items execute first and prepares the reset.
>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>
>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>
>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>
>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>
>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>
>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>
>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>
>> Christian.
>
>
> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
> to the GPU during reset from various places in the code.
>
> Andrey
>
Hi Andrey & Christian,

I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:

[  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  428.400582] [drm] clean up the vf2pf work item
[  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
[  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
[  437.531392] amdgpu: qcm fence wait loop timeout expired
[  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
[  438.087443] [drm] RE-INIT-early: nv_common succeeded

As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.

>
>>
>>>> Andrey
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>
>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>
>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>
>>>>>> Thanks
>>>>>> -------------------------------------------------------------------
>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>> -------------------------------------------------------------------
>>>>>> we are hiring software manager for CVS core team
>>>>>> -------------------------------------------------------------------
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>> Cc: daniel@ffwll.ch
>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>
>>>>>> Hi Jingwen,
>>>>>>
>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>
>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>
>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>
>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>
>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>> Hi Christian,
>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>
>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>
>>>>>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Jingwen Chen
>>>>>>>
>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>
>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>
>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>
>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>
>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- 
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>> -     */
>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>> -
>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>> -
>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>> */
>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>> -     */
>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>> -
>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>> -
>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>> */
>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||
>>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-06  4:59                                   ` JingWen Chen
@ 2022-01-06  5:18                                     ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-06  5:18 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace


On 2022/1/6 下午12:59, JingWen Chen wrote:
> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>
>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>
>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>
>>>>>> As far as I can see the procedure should be:
>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>
>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>
>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>
>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>
>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>
>>> Christian.
>>
>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>> to the GPU during reset from various places in the code.
>>
>> Andrey
>>
> Hi Andrey & Christian,
>
> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>
> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  428.400582] [drm] clean up the vf2pf work item
> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
> [  437.531392] amdgpu: qcm fence wait loop timeout expired
> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>
> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>
>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>
>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>
>>>>>>> Thanks
>>>>>>> -------------------------------------------------------------------
>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>> -------------------------------------------------------------------
>>>>>>> we are hiring software manager for CVS core team
>>>>>>> -------------------------------------------------------------------
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>> Cc: daniel@ffwll.ch
>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>
>>>>>>> Hi Jingwen,
>>>>>>>
>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>
>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>
>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>
>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>> Hi Christian,
>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>
>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>
>>>>>>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Jingwen Chen
>>>>>>>>
>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>
>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>
>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>
>>>>>>>>>>>> Best wishes
>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>
>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-06  5:18                                     ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-06  5:18 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel


On 2022/1/6 下午12:59, JingWen Chen wrote:
> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>
>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>
>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>
>>>>>> As far as I can see the procedure should be:
>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>
>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>
>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>
>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>
>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>
>>> Christian.
>>
>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>> to the GPU during reset from various places in the code.
>>
>> Andrey
>>
> Hi Andrey & Christian,
>
> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>
> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
> [  428.400582] [drm] clean up the vf2pf work item
> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
> [  437.531392] amdgpu: qcm fence wait loop timeout expired
> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>
> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>
>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>
>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>
>>>>>>> Thanks
>>>>>>> -------------------------------------------------------------------
>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>> -------------------------------------------------------------------
>>>>>>> we are hiring software manager for CVS core team
>>>>>>> -------------------------------------------------------------------
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>> Cc: daniel@ffwll.ch
>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>
>>>>>>> Hi Jingwen,
>>>>>>>
>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>
>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>
>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>
>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>> Hi Christian,
>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>
>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>
>>>>>>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Jingwen Chen
>>>>>>>>
>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>
>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>
>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>
>>>>>>>>>>>> Best wishes
>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>
>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-06  5:18                                     ` JingWen Chen
@ 2022-01-06  9:13                                       ` Christian König
  -1 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-06  9:13 UTC (permalink / raw)
  To: JingWen Chen, Andrey Grodzovsky, Christian König, Liu, Monk,
	Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen, Horace

Am 06.01.22 um 06:18 schrieb JingWen Chen:
> On 2022/1/6 下午12:59, JingWen Chen wrote:
>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>
>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>
>>>>>>> As far as I can see the procedure should be:
>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>
>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>
>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>
>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>
>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>
>>>> Christian.
>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>> to the GPU during reset from various places in the code.
>>>
>>> Andrey
>>>
>> Hi Andrey & Christian,
>>
>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>
>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  428.400582] [drm] clean up the vf2pf work item
>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>
>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.

Yeah, I was just about to complain that this isn't good engineering but 
just makes things less likely.

Question is what would be the approach to avoid those kind of problems 
from the start?

Regards,
Christian.

>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>
>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>> -------------------------------------------------------------------
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>
>>>>>>>> Hi Jingwen,
>>>>>>>>
>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>
>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>
>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>
>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>> Hi Christian,
>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>
>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>
>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Jingwen Chen
>>>>>>>>>
>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>
>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>
>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-06  9:13                                       ` Christian König
  0 siblings, 0 replies; 103+ messages in thread
From: Christian König @ 2022-01-06  9:13 UTC (permalink / raw)
  To: JingWen Chen, Andrey Grodzovsky, Christian König, Liu, Monk,
	Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen, Horace
  Cc: daniel

Am 06.01.22 um 06:18 schrieb JingWen Chen:
> On 2022/1/6 下午12:59, JingWen Chen wrote:
>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>
>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>
>>>>>>> As far as I can see the procedure should be:
>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>
>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>
>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>
>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>
>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>
>>>> Christian.
>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>> to the GPU during reset from various places in the code.
>>>
>>> Andrey
>>>
>> Hi Andrey & Christian,
>>
>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>
>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  428.400582] [drm] clean up the vf2pf work item
>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>
>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.

Yeah, I was just about to complain that this isn't good engineering but 
just makes things less likely.

Question is what would be the approach to avoid those kind of problems 
from the start?

Regards,
Christian.

>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>
>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>> -------------------------------------------------------------------
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>
>>>>>>>> Hi Jingwen,
>>>>>>>>
>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>
>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>
>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>
>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>> Hi Christian,
>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>
>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>
>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Jingwen Chen
>>>>>>>>>
>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>
>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>
>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-06  5:18                                     ` JingWen Chen
@ 2022-01-06 19:13                                       ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-06 19:13 UTC (permalink / raw)
  To: JingWen Chen, Christian König, Christian König, Liu,
	Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace


On 2022-01-06 12:18 a.m., JingWen Chen wrote:
> On 2022/1/6 下午12:59, JingWen Chen wrote:
>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>
>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>
>>>>>>> As far as I can see the procedure should be:
>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>
>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>
>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>
>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>
>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>
>>>> Christian.
>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>> to the GPU during reset from various places in the code.
>>>
>>> Andrey
>>>
>> Hi Andrey & Christian,
>>
>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:


Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent 
GPU reset protection for SRIOV' ?


>>
>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  428.400582] [drm] clean up the vf2pf work item
>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>
>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.

Are you saying that the entire patch-set with and without patch 
'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
is casing this GPUVM page fault during testing engine hang while running 
benchmark ?

Do you never observe this page fault when running this test with 
original tree without the new patch-set ?

Andrey


>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>
>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>> -------------------------------------------------------------------
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>
>>>>>>>> Hi Jingwen,
>>>>>>>>
>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>
>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>
>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>
>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>> Hi Christian,
>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>
>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>
>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Jingwen Chen
>>>>>>>>>
>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>
>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>
>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-06 19:13                                       ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-06 19:13 UTC (permalink / raw)
  To: JingWen Chen, Christian König, Christian König, Liu,
	Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel


On 2022-01-06 12:18 a.m., JingWen Chen wrote:
> On 2022/1/6 下午12:59, JingWen Chen wrote:
>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>
>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>
>>>>>>> As far as I can see the procedure should be:
>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>
>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>
>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>
>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>
>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>
>>>> Christian.
>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>> to the GPU during reset from various places in the code.
>>>
>>> Andrey
>>>
>> Hi Andrey & Christian,
>>
>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:


Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent 
GPU reset protection for SRIOV' ?


>>
>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>> [  428.400582] [drm] clean up the vf2pf work item
>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>
>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.

Are you saying that the entire patch-set with and without patch 
'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
is casing this GPUVM page fault during testing engine hang while running 
benchmark ?

Do you never observe this page fault when running this test with 
original tree without the new patch-set ?

Andrey


>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>
>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>
>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>> -------------------------------------------------------------------
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>
>>>>>>>> Hi Jingwen,
>>>>>>>>
>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>
>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>
>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>
>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>> Hi Christian,
>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>
>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>
>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Jingwen Chen
>>>>>>>>>
>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>
>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>
>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-06 19:13                                       ` Andrey Grodzovsky
@ 2022-01-07  3:57                                         ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-07  3:57 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace


On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>
> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>
>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>
>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>
>>>>>>>> As far as I can see the procedure should be:
>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>
>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>
>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>
>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>
>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>
>>>>> Christian.
>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>> to the GPU during reset from various places in the code.
>>>>
>>>> Andrey
>>>>
>>> Hi Andrey & Christian,
>>>
>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>
>
> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>
>
I ported the entire patchset
>>>
>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  428.400582] [drm] clean up the vf2pf work item
>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>
>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>
> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>
> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>
> Andrey
>
I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.

We may need a new way to protect the reset in SRIOV.

>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>
>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>
>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>
>>>>>>>>> Hi Jingwen,
>>>>>>>>>
>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>
>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>
>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>
>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>
>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>> Hi Christian,
>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>
>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>
>>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Jingwen Chen
>>>>>>>>>>
>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>
>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-07  3:57                                         ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-07  3:57 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace
  Cc: daniel


On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>
> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>
>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>
>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>
>>>>>>>> As far as I can see the procedure should be:
>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>
>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>
>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>
>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>
>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>
>>>>> Christian.
>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>> to the GPU during reset from various places in the code.
>>>>
>>>> Andrey
>>>>
>>> Hi Andrey & Christian,
>>>
>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>
>
> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>
>
I ported the entire patchset
>>>
>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  428.400582] [drm] clean up the vf2pf work item
>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>
>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>
> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>
> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>
> Andrey
>
I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.

We may need a new way to protect the reset in SRIOV.

>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>
>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>
>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>
>>>>>>>>> Hi Jingwen,
>>>>>>>>>
>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>
>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>
>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>
>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>
>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>> Hi Christian,
>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>
>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>
>>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Jingwen Chen
>>>>>>>>>>
>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>
>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-07  3:57                                         ` JingWen Chen
@ 2022-01-07  5:46                                           ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-07  5:46 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, kaili.wang


On 2022/1/7 上午11:57, JingWen Chen wrote:
> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>
>>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>>
>>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>>
>>>>>>>>> As far as I can see the procedure should be:
>>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>>
>>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>>
>>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>>
>>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>>
>>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>>
>>>>>> Christian.
>>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>>> to the GPU during reset from various places in the code.
>>>>>
>>>>> Andrey
>>>>>
>>>> Hi Andrey & Christian,
>>>>
>>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>
>> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>>
>>
> I ported the entire patchset
>>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  428.400582] [drm] clean up the vf2pf work item
>>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>>
>>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>>
>> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>>
>> Andrey
>>
> I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.
>
> We may need a new way to protect the reset in SRIOV.
>
Hi Andrey

Actually, I would like to propose a RFC based on your patch, which will move the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host will wait a certain time till the pre_reset work done and guest send back response then actually do the vf flr. Hopefully this will help solving the page fault issue.

JingWen

>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>
>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>>
>>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>>
>>>>>>>>>> Hi Jingwen,
>>>>>>>>>>
>>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>>
>>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>
>>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>>
>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>>> Hi Christian,
>>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>>
>>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>>
>>>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Jingwen Chen
>>>>>>>>>>>
>>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>>
>>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-07  5:46                                           ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-07  5:46 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, kaili.wang
  Cc: daniel


On 2022/1/7 上午11:57, JingWen Chen wrote:
> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>
>>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>>
>>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>>
>>>>>>>>> As far as I can see the procedure should be:
>>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>>
>>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>>
>>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>>
>>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>>
>>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>>
>>>>>> Christian.
>>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>>> to the GPU during reset from various places in the code.
>>>>>
>>>>> Andrey
>>>>>
>>>> Hi Andrey & Christian,
>>>>
>>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>
>> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>>
>>
> I ported the entire patchset
>>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  428.400582] [drm] clean up the vf2pf work item
>>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>>
>>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>>
>> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>>
>> Andrey
>>
> I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.
>
> We may need a new way to protect the reset in SRIOV.
>
Hi Andrey

Actually, I would like to propose a RFC based on your patch, which will move the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host will wait a certain time till the pre_reset work done and guest send back response then actually do the vf flr. Hopefully this will help solving the page fault issue.

JingWen

>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>
>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>>
>>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>>
>>>>>>>>>> Hi Jingwen,
>>>>>>>>>>
>>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>>
>>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>
>>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>>
>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>>> Hi Christian,
>>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>>
>>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>>
>>>>>>>>>>>     From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Jingwen Chen
>>>>>>>>>>>
>>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>>
>>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-07  5:46                                           ` JingWen Chen
@ 2022-01-07 16:02                                             ` Andrey Grodzovsky
  -1 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-07 16:02 UTC (permalink / raw)
  To: JingWen Chen, Christian König, Christian König, Liu,
	Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, kaili.wang


On 2022-01-07 12:46 a.m., JingWen Chen wrote:
> On 2022/1/7 上午11:57, JingWen Chen wrote:
>> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>>> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>>>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>
>>>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>>>
>>>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>>>
>>>>>>>>>> As far as I can see the procedure should be:
>>>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>>>
>>>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>>>
>>>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>>>
>>>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>>>
>>>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>>>
>>>>>>> Christian.
>>>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>>>> to the GPU during reset from various places in the code.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>> Hi Andrey & Christian,
>>>>>
>>>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>>>
>>>
>> I ported the entire patchset
>>>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  428.400582] [drm] clean up the vf2pf work item
>>>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>>>
>>>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>>>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>>> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>>>
>>> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>>>
>>> Andrey
>>>
>> I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.
>>
>> We may need a new way to protect the reset in SRIOV.
>>
> Hi Andrey
>
> Actually, I would like to propose a RFC based on your patch, which will move the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host will wait a certain time till the pre_reset work done and guest send back response then actually do the vf flr. Hopefully this will help solving the page fault issue.
>
> JingWen


This makes sense to me, you want the guest driver to be as idle as 
possible before host side starts actual reset. Go ahead and try it on 
top of my patch-set and update with results.
I am away all next week but will try to find time and peek at your updates.

Another question - how much the switch to single threaded reset makes 
SRIOV more unstable ? Is it OK to push the patches as is without your 
RFC or we need to wait for your RFC before push ?

Andrey


>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>
>>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>>>
>>>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> Hi Jingwen,
>>>>>>>>>>>
>>>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>>>
>>>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>
>>>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>>>
>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>>>
>>>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>>>
>>>>>>>>>>>>      From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>> Jingwen Chen
>>>>>>>>>>>>
>>>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>>>         2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>             int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>             int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-07 16:02                                             ` Andrey Grodzovsky
  0 siblings, 0 replies; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-07 16:02 UTC (permalink / raw)
  To: JingWen Chen, Christian König, Christian König, Liu,
	Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, kaili.wang
  Cc: daniel


On 2022-01-07 12:46 a.m., JingWen Chen wrote:
> On 2022/1/7 上午11:57, JingWen Chen wrote:
>> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>>> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>>>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>
>>>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>>>
>>>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>>>
>>>>>>>>>> As far as I can see the procedure should be:
>>>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>>>
>>>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>>>
>>>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>>>
>>>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>>>
>>>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>>>
>>>>>>> Christian.
>>>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>>>> to the GPU during reset from various places in the code.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>> Hi Andrey & Christian,
>>>>>
>>>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>>>
>>>
>> I ported the entire patchset
>>>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>> [  428.400582] [drm] clean up the vf2pf work item
>>>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>>>
>>>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>>>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>>> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>>>
>>> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>>>
>>> Andrey
>>>
>> I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.
>>
>> We may need a new way to protect the reset in SRIOV.
>>
> Hi Andrey
>
> Actually, I would like to propose a RFC based on your patch, which will move the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host will wait a certain time till the pre_reset work done and guest send back response then actually do the vf flr. Hopefully this will help solving the page fault issue.
>
> JingWen


This makes sense to me, you want the guest driver to be as idle as 
possible before host side starts actual reset. Go ahead and try it on 
top of my patch-set and update with results.
I am away all next week but will try to find time and peek at your updates.

Another question - how much the switch to single threaded reset makes 
SRIOV more unstable ? Is it OK to push the patches as is without your 
RFC or we need to wait for your RFC before push ?

Andrey


>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>
>>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>>>
>>>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> Hi Jingwen,
>>>>>>>>>>>
>>>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>>>
>>>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>
>>>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>>>
>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>>>
>>>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>>>
>>>>>>>>>>>>      From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>> Jingwen Chen
>>>>>>>>>>>>
>>>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>>>         2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>             int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>             int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
  2022-01-07 16:02                                             ` Andrey Grodzovsky
@ 2022-01-12  6:28                                               ` JingWen Chen
  -1 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-12  6:28 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, kaili.wang

Hi Andrey,

Please go ahead and push your change. I will prepare the RFC later.

On 2022/1/8 上午12:02, Andrey Grodzovsky wrote:
>
> On 2022-01-07 12:46 a.m., JingWen Chen wrote:
>> On 2022/1/7 上午11:57, JingWen Chen wrote:
>>> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>>>> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>>>>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>>>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>
>>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>>
>>>>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>>>>
>>>>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>>>>
>>>>>>>>>>> As far as I can see the procedure should be:
>>>>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>>>>
>>>>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>>>>
>>>>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>>>>
>>>>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>>>>
>>>>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>>>>
>>>>>>>> Christian.
>>>>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>>>>> to the GPU during reset from various places in the code.
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>> Hi Andrey & Christian,
>>>>>>
>>>>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>>> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>>>>
>>>>
>>> I ported the entire patchset
>>>>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  428.400582] [drm] clean up the vf2pf work item
>>>>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>>>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>>>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>>>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>>>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>>>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>>>>
>>>>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>>>>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>>>> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>>>>
>>>> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>>>>
>>>> Andrey
>>>>
>>> I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.
>>>
>>> We may need a new way to protect the reset in SRIOV.
>>>
>> Hi Andrey
>>
>> Actually, I would like to propose a RFC based on your patch, which will move the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host will wait a certain time till the pre_reset work done and guest send back response then actually do the vf flr. Hopefully this will help solving the page fault issue.
>>
>> JingWen
>
>
> This makes sense to me, you want the guest driver to be as idle as possible before host side starts actual reset. Go ahead and try it on top of my patch-set and update with results.
> I am away all next week but will try to find time and peek at your updates.
>
> Another question - how much the switch to single threaded reset makes SRIOV more unstable ? Is it OK to push the patches as is without your RFC or we need to wait for your RFC before push ?
>
> Andrey
>
>
>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>
>>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>>
>>>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>>>>
>>>>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jingwen,
>>>>>>>>>>>>
>>>>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>>>>
>>>>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>>>>
>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>>>>
>>>>>>>>>>>>>      From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Jingwen Chen
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>>>>         2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>>             int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>>             int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
@ 2022-01-12  6:28                                               ` JingWen Chen
  0 siblings, 0 replies; 103+ messages in thread
From: JingWen Chen @ 2022-01-12  6:28 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Christian König,
	Liu, Monk, Chen, JingWen, Deng, Emily, dri-devel, amd-gfx, Chen,
	Horace, kaili.wang
  Cc: daniel

Hi Andrey,

Please go ahead and push your change. I will prepare the RFC later.

On 2022/1/8 上午12:02, Andrey Grodzovsky wrote:
>
> On 2022-01-07 12:46 a.m., JingWen Chen wrote:
>> On 2022/1/7 上午11:57, JingWen Chen wrote:
>>> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>>>> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>>>>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>>>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>
>>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>>
>>>>>>>>>>> Then we have a major design issue in the SRIOV protocol and really need to question this.
>>>>>>>>>>>
>>>>>>>>>>> How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?
>>>>>>>>>>>
>>>>>>>>>>> As far as I can see the procedure should be:
>>>>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
>>>>>>>>>>> 2. For each of those potential reset sources a work item is send to the single workqueue.
>>>>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
>>>>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>>>>> 6. Cancel work items which might have been scheduled from other reset sources.
>>>>>>>>>>>
>>>>>>>>>>> It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.
>>>>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long'
>>>>>>>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET to be recived from guest before starting the reset then setting in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>>>>>>>> protection from MMIO accesses by the guest - it only truly helps if hypervisor waits for that message before initiation of HW reset.
>>>>>>>>>>
>>>>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has the chance to send the response back, then other VFs will have to wait it reset. All the vfs will hang in this case. Or sometimes the mailbox has some delay and other VFs will also wait. The user of other VFs will be affected in this case.
>>>>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge here, not the guest.
>>>>>>>>
>>>>>>>> What the hypervisor should do (and it already seems to be designed that way) is to send the guest a message that a reset is about to happen and give it some time to response appropriately.
>>>>>>>>
>>>>>>>> The guest on the other hand then tells the hypervisor that all processing has stopped and it is ready to restart. If that doesn't happen in time the hypervisor should eliminate the guest probably trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>>>>
>>>>>>>> Christian.
>>>>>>> So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>>>>> to the GPU during reset from various places in the code.
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>> Hi Andrey & Christian,
>>>>>>
>>>>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below:
>>>> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV' ?
>>>>
>>>>
>>> I ported the entire patchset
>>>>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>>>> [  428.400582] [drm] clean up the vf2pf work item
>>>>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557)
>>>>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2)
>>>>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>>>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
>>>>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>>>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>>>>
>>>>>> As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily.
>>>>> To be clear, we can also hit the page fault with the reset_sem and in_gpu_reset, just not as easily as dropping them.
>>>> Are you saying that the entire patch-set with and without patch 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>> is casing this GPUVM page fault during testing engine hang while running benchmark ?
>>>>
>>>> Do you never observe this page fault when running this test with original tree without the new patch-set ?
>>>>
>>>> Andrey
>>>>
>>> I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.
>>>
>>> We may need a new way to protect the reset in SRIOV.
>>>
>> Hi Andrey
>>
>> Actually, I would like to propose a RFC based on your patch, which will move the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host will wait a certain time till the pre_reset work done and guest send back response then actually do the vf flr. Hopefully this will help solving the page fault issue.
>>
>> JingWen
>
>
> This makes sense to me, you want the guest driver to be as idle as possible before host side starts actual reset. Go ahead and try it on top of my patch-set and update with results.
> I am away all next week but will try to find time and peek at your updates.
>
> Another question - how much the switch to single threaded reset makes SRIOV more unstable ? Is it OK to push the patches as is without your RFC or we need to wait for your RFC before push ?
>
> Andrey
>
>
>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>
>>>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR is about to start or was already executed, but host will do FLR anyway without waiting for guest too long
>>>>>>>>>>>>
>>>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>>>>
>>>>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>>>>> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>
>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jingwen,
>>>>>>>>>>>>
>>>>>>>>>>>> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>>>>>>>>>>>>
>>>>>>>>>>>> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>>>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>>>>>>>>>>>>
>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>>>>>>>>>>>>
>>>>>>>>>>>>>      From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Jingwen Chen
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>;
>>>>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org;
>>>>>>>>>>>>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch
>>>>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>>>>>>>>>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>>>>>>>>>>>>>>>>>> <Horace.Chen@amd.com>
>>>>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>>>>         2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>>             int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>>             int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>             } while (timeout > 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>         flr_done:
>>>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>             /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>             if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>>>                 && (!amdgpu_device_has_job_running(adev) ||

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-05 18:11         ` Andrey Grodzovsky
  (?)
@ 2022-01-17 19:14         ` Andrey Grodzovsky
  2022-01-17 19:17           ` Christian König
  -1 siblings, 1 reply; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-17 19:14 UTC (permalink / raw)
  To: Christian König, Lazar, Lijo, dri-devel, amd-gfx
  Cc: horace.chen, Monk.Liu

[-- Attachment #1: Type: text/plain, Size: 778 bytes --]

Ping on the question

Andrey

On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
>>> Also, what about having the reset_active or in_reset flag in the 
>>> reset_domain itself?
>>
>> Of hand that sounds like a good idea.
>
>
> What then about the adev->reset_sem semaphore ? Should we also move 
> this to reset_domain ?  Both of the moves have functional
> implications only for XGMI case because there will be contention over 
> accessing those single instance variables from multiple devices
> while now each device has it's own copy.
>
> What benefit the centralization into reset_domain gives - is it for 
> example to prevent one device in a hive trying to access through MMIO 
> another one's
> VRAM (shared FB memory) while the other one goes through reset ?
>
> Andrey 

[-- Attachment #2: Type: text/html, Size: 1428 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-17 19:14         ` Andrey Grodzovsky
@ 2022-01-17 19:17           ` Christian König
  2022-01-17 19:21             ` Andrey Grodzovsky
  0 siblings, 1 reply; 103+ messages in thread
From: Christian König @ 2022-01-17 19:17 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, Lazar, Lijo, dri-devel, amd-gfx
  Cc: horace.chen, Monk.Liu

[-- Attachment #1: Type: text/plain, Size: 1200 bytes --]

Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:
>
> Ping on the question
>

Oh, my! That was already more than a week ago and is completely swapped 
out of my head again.

> Andrey
>
> On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
>>>> Also, what about having the reset_active or in_reset flag in the 
>>>> reset_domain itself?
>>>
>>> Of hand that sounds like a good idea.
>>
>>
>> What then about the adev->reset_sem semaphore ? Should we also move 
>> this to reset_domain ?  Both of the moves have functional
>> implications only for XGMI case because there will be contention over 
>> accessing those single instance variables from multiple devices
>> while now each device has it's own copy.

Since this is a rw semaphore that should be unproblematic I think. It 
could just be that the cache line of the lock then plays ping/pong 
between the CPU cores.

>>
>> What benefit the centralization into reset_domain gives - is it for 
>> example to prevent one device in a hive trying to access through MMIO 
>> another one's
>> VRAM (shared FB memory) while the other one goes through reset ?

I think that this is the killer argument for a centralized lock, yes.

Christian.

>>
>> Andrey 


[-- Attachment #2: Type: text/html, Size: 2698 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-17 19:17           ` Christian König
@ 2022-01-17 19:21             ` Andrey Grodzovsky
  2022-01-26 15:52               ` Andrey Grodzovsky
  0 siblings, 1 reply; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-17 19:21 UTC (permalink / raw)
  To: Christian König, Christian König, Lazar, Lijo,
	dri-devel, amd-gfx
  Cc: horace.chen, Monk.Liu

[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]


On 2022-01-17 2:17 p.m., Christian König wrote:
> Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:
>>
>> Ping on the question
>>
>
> Oh, my! That was already more than a week ago and is completely 
> swapped out of my head again.
>
>> Andrey
>>
>> On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
>>>>> Also, what about having the reset_active or in_reset flag in the 
>>>>> reset_domain itself?
>>>>
>>>> Of hand that sounds like a good idea.
>>>
>>>
>>> What then about the adev->reset_sem semaphore ? Should we also move 
>>> this to reset_domain ?  Both of the moves have functional
>>> implications only for XGMI case because there will be contention 
>>> over accessing those single instance variables from multiple devices
>>> while now each device has it's own copy.
>
> Since this is a rw semaphore that should be unproblematic I think. It 
> could just be that the cache line of the lock then plays ping/pong 
> between the CPU cores.
>
>>>
>>> What benefit the centralization into reset_domain gives - is it for 
>>> example to prevent one device in a hive trying to access through 
>>> MMIO another one's
>>> VRAM (shared FB memory) while the other one goes through reset ?
>
> I think that this is the killer argument for a centralized lock, yes.


np, i will add a patch with centralizing both flag into reset domain and 
resend.

Andrey


>
> Christian.
>
>>>
>>> Andrey 
>

[-- Attachment #2: Type: text/html, Size: 3184 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-17 19:21             ` Andrey Grodzovsky
@ 2022-01-26 15:52               ` Andrey Grodzovsky
  2022-01-28 16:57                 ` Grodzovsky, Andrey
  0 siblings, 1 reply; 103+ messages in thread
From: Andrey Grodzovsky @ 2022-01-26 15:52 UTC (permalink / raw)
  To: Christian König, Christian König, Lazar, Lijo,
	dri-devel, amd-gfx, JingWen Chen
  Cc: horace.chen, Monk.Liu

[-- Attachment #1: Type: text/plain, Size: 1776 bytes --]

JingWen - could you maybe give those patches a try on SRIOV XGMI system 
? If you see issues maybe you could let me connect and debug. My SRIOV 
XGMI system which Shayun kindly arranged for me is not loading the 
driver with my drm-misc-next branch even without my patches.

Andrey

On 2022-01-17 14:21, Andrey Grodzovsky wrote:
>
>
> On 2022-01-17 2:17 p.m., Christian König wrote:
>> Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:
>>>
>>> Ping on the question
>>>
>>
>> Oh, my! That was already more than a week ago and is completely 
>> swapped out of my head again.
>>
>>> Andrey
>>>
>>> On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
>>>>>> Also, what about having the reset_active or in_reset flag in the 
>>>>>> reset_domain itself?
>>>>>
>>>>> Of hand that sounds like a good idea.
>>>>
>>>>
>>>> What then about the adev->reset_sem semaphore ? Should we also move 
>>>> this to reset_domain ?  Both of the moves have functional
>>>> implications only for XGMI case because there will be contention 
>>>> over accessing those single instance variables from multiple devices
>>>> while now each device has it's own copy.
>>
>> Since this is a rw semaphore that should be unproblematic I think. It 
>> could just be that the cache line of the lock then plays ping/pong 
>> between the CPU cores.
>>
>>>>
>>>> What benefit the centralization into reset_domain gives - is it for 
>>>> example to prevent one device in a hive trying to access through 
>>>> MMIO another one's
>>>> VRAM (shared FB memory) while the other one goes through reset ?
>>
>> I think that this is the killer argument for a centralized lock, yes.
>
>
> np, i will add a patch with centralizing both flag into reset domain 
> and resend.
>
> Andrey
>
>
>>
>> Christian.
>>
>>>>
>>>> Andrey 
>>

[-- Attachment #2: Type: text/html, Size: 3886 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-26 15:52               ` Andrey Grodzovsky
@ 2022-01-28 16:57                 ` Grodzovsky, Andrey
  2022-02-07  2:41                   ` JingWen Chen
  0 siblings, 1 reply; 103+ messages in thread
From: Grodzovsky, Andrey @ 2022-01-28 16:57 UTC (permalink / raw)
  To: Christian König, Koenig, Christian, Lazar, Lijo, dri-devel,
	amd-gfx, Chen, JingWen
  Cc: Chen, Horace, Liu, Monk

[-- Attachment #1: Type: text/plain, Size: 2219 bytes --]

Just a gentle ping.

Andrey
________________________________
From: Grodzovsky, Andrey
Sent: 26 January 2022 10:52
To: Christian König <ckoenig.leichtzumerken@gmail.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Chen, JingWen <JingWen.Chen2@amd.com>
Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs


JingWen - could you maybe give those patches a try on SRIOV XGMI system ? If you see issues maybe you could let me connect and debug. My SRIOV XGMI system which Shayun kindly arranged for me is not loading the driver with my drm-misc-next branch even without my patches.

Andrey

On 2022-01-17 14:21, Andrey Grodzovsky wrote:


On 2022-01-17 2:17 p.m., Christian König wrote:
Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:

Ping on the question

Oh, my! That was already more than a week ago and is completely swapped out of my head again.


Andrey

On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
Also, what about having the reset_active or in_reset flag in the reset_domain itself?

Of hand that sounds like a good idea.


What then about the adev->reset_sem semaphore ? Should we also move this to reset_domain ?  Both of the moves have functional
implications only for XGMI case because there will be contention over accessing those single instance variables from multiple devices
while now each device has it's own copy.

Since this is a rw semaphore that should be unproblematic I think. It could just be that the cache line of the lock then plays ping/pong between the CPU cores.


What benefit the centralization into reset_domain gives - is it for example to prevent one device in a hive trying to access through MMIO another one's
VRAM (shared FB memory) while the other one goes through reset ?

I think that this is the killer argument for a centralized lock, yes.


np, i will add a patch with centralizing both flag into reset domain and resend.

Andrey


Christian.


Andrey


[-- Attachment #2: Type: text/html, Size: 3518 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-01-28 16:57                 ` Grodzovsky, Andrey
@ 2022-02-07  2:41                   ` JingWen Chen
  2022-02-07  3:08                     ` Grodzovsky, Andrey
  0 siblings, 1 reply; 103+ messages in thread
From: JingWen Chen @ 2022-02-07  2:41 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Christian König, Koenig,  Christian,
	Lazar, Lijo, dri-devel, amd-gfx, Chen, JingWen
  Cc: Chen, Horace, Liu, Monk

[-- Attachment #1: Type: text/plain, Size: 3543 bytes --]

Hi Andrey,

I don't have any XGMI machines here, maybe you can reach out shaoyun for help.

On 2022/1/29 上午12:57, Grodzovsky, Andrey wrote:
> Just a gentle ping.
>
> Andrey
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* Grodzovsky, Andrey
> *Sent:* 26 January 2022 10:52
> *To:* Christian König <ckoenig.leichtzumerken@gmail.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Chen, JingWen <JingWen.Chen2@amd.com>
> *Cc:* Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
> *Subject:* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>  
>
> JingWen - could you maybe give those patches a try on SRIOV XGMI system ? If you see issues maybe you could let me connect and debug. My SRIOV XGMI system which Shayun kindly arranged for me is not loading the driver with my drm-misc-next branch even without my patches.
>
> Andrey
>
> On 2022-01-17 14:21, Andrey Grodzovsky wrote:
>>
>>
>> On 2022-01-17 2:17 p.m., Christian König wrote:
>>> Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:
>>>>
>>>> Ping on the question
>>>>
>>>
>>> Oh, my! That was already more than a week ago and is completely swapped out of my head again.
>>>
>>>> Andrey
>>>>
>>>> On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
>>>>>>> Also, what about having the reset_active or in_reset flag in the reset_domain itself?
>>>>>>
>>>>>> Of hand that sounds like a good idea.
>>>>>
>>>>>
>>>>> What then about the adev->reset_sem semaphore ? Should we also move this to reset_domain ?  Both of the moves have functional
>>>>> implications only for XGMI case because there will be contention over accessing those single instance variables from multiple devices
>>>>> while now each device has it's own copy.
>>>
>>> Since this is a rw semaphore that should be unproblematic I think. It could just be that the cache line of the lock then plays ping/pong between the CPU cores.
>>>
>>>>>
>>>>> What benefit the centralization into reset_domain gives - is it for example to prevent one device in a hive trying to access through MMIO another one's
>>>>> VRAM (shared FB memory) while the other one goes through reset ?
>>>
>>> I think that this is the killer argument for a centralized lock, yes.
>>
>>
>> np, i will add a patch with centralizing both flag into reset domain and resend.
>>
>> Andrey
>>
>>
>>>
>>> Christian.
>>>
>>>>>
>>>>> Andrey 
>>>

[-- Attachment #2: Type: text/html, Size: 6094 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  2022-02-07  2:41                   ` JingWen Chen
@ 2022-02-07  3:08                     ` Grodzovsky, Andrey
  0 siblings, 0 replies; 103+ messages in thread
From: Grodzovsky, Andrey @ 2022-02-07  3:08 UTC (permalink / raw)
  To: Chen, JingWen, Christian König, Koenig, Christian, Lazar,
	Lijo, dri-devel, amd-gfx, Liu, Shaoyun
  Cc: Chen, Horace, Liu, Monk

[-- Attachment #1: Type: text/plain, Size: 3549 bytes --]

I already did, thanks to Shayun I already tested on XGMI SRIOV and it looks ok. What I need now is code review, mostly on the new patches (8-12). I hope you, Monk, Shayun, Lijo and Christian can help with that.

Andrey
________________________________
From: Chen, JingWen <JingWen.Chen2@amd.com>
Sent: 06 February 2022 21:41
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Chen, JingWen <JingWen.Chen2@amd.com>
Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs


Hi Andrey,

I don't have any XGMI machines here, maybe you can reach out shaoyun for help.

On 2022/1/29 上午12:57, Grodzovsky, Andrey wrote:
Just a gentle ping.

Andrey
________________________________
From: Grodzovsky, Andrey
Sent: 26 January 2022 10:52
To: Christian König <ckoenig.leichtzumerken@gmail.com><mailto:ckoenig.leichtzumerken@gmail.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com><mailto:Lijo.Lazar@amd.com>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org> <dri-devel@lists.freedesktop.org><mailto:dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>; Chen, JingWen <JingWen.Chen2@amd.com><mailto:JingWen.Chen2@amd.com>
Cc: Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>
Subject: Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs


JingWen - could you maybe give those patches a try on SRIOV XGMI system ? If you see issues maybe you could let me connect and debug. My SRIOV XGMI system which Shayun kindly arranged for me is not loading the driver with my drm-misc-next branch even without my patches.

Andrey

On 2022-01-17 14:21, Andrey Grodzovsky wrote:


On 2022-01-17 2:17 p.m., Christian König wrote:
Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:

Ping on the question

Oh, my! That was already more than a week ago and is completely swapped out of my head again.


Andrey

On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
Also, what about having the reset_active or in_reset flag in the reset_domain itself?

Of hand that sounds like a good idea.


What then about the adev->reset_sem semaphore ? Should we also move this to reset_domain ?  Both of the moves have functional
implications only for XGMI case because there will be contention over accessing those single instance variables from multiple devices
while now each device has it's own copy.

Since this is a rw semaphore that should be unproblematic I think. It could just be that the cache line of the lock then plays ping/pong between the CPU cores.


What benefit the centralization into reset_domain gives - is it for example to prevent one device in a hive trying to access through MMIO another one's
VRAM (shared FB memory) while the other one goes through reset ?

I think that this is the killer argument for a centralized lock, yes.


np, i will add a patch with centralizing both flag into reset domain and resend.

Andrey


Christian.


Andrey


[-- Attachment #2: Type: text/html, Size: 5754 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2022-02-07  3:08 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-22 22:04 [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu Andrey Grodzovsky
2021-12-22 22:04 ` Andrey Grodzovsky
2021-12-22 22:04 ` [RFC v2 1/8] drm/amdgpu: Introduce reset domain Andrey Grodzovsky
2021-12-22 22:04   ` Andrey Grodzovsky
2021-12-22 22:05 ` [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky
2021-12-22 22:05   ` Andrey Grodzovsky
2021-12-23  8:39   ` Christian König
2021-12-23  8:39     ` Christian König
2021-12-22 22:05 ` [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky
2021-12-22 22:05   ` Andrey Grodzovsky
2021-12-23  8:40   ` Christian König
2021-12-23  8:40     ` Christian König
2021-12-22 22:05 ` [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky
2021-12-22 22:05   ` Andrey Grodzovsky
2021-12-23  8:41   ` Christian König
2021-12-23  8:41     ` Christian König
2022-01-05  9:54   ` Lazar, Lijo
2022-01-05  9:54     ` Lazar, Lijo
2022-01-05 12:31     ` Christian König
2022-01-05 12:31       ` Christian König
2022-01-05 13:11       ` Lazar, Lijo
2022-01-05 13:11         ` Lazar, Lijo
2022-01-05 13:15         ` Christian König
2022-01-05 13:15           ` Christian König
2022-01-05 13:26           ` Lazar, Lijo
2022-01-05 13:26             ` Lazar, Lijo
2022-01-05 13:41             ` Christian König
2022-01-05 13:41               ` Christian König
2022-01-05 18:11       ` Andrey Grodzovsky
2022-01-05 18:11         ` Andrey Grodzovsky
2022-01-17 19:14         ` Andrey Grodzovsky
2022-01-17 19:17           ` Christian König
2022-01-17 19:21             ` Andrey Grodzovsky
2022-01-26 15:52               ` Andrey Grodzovsky
2022-01-28 16:57                 ` Grodzovsky, Andrey
2022-02-07  2:41                   ` JingWen Chen
2022-02-07  3:08                     ` Grodzovsky, Andrey
2021-12-22 22:13 ` [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-22 22:13   ` Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky
2021-12-22 22:13     ` Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky
2021-12-22 22:13     ` Andrey Grodzovsky
2021-12-22 22:14   ` [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Andrey Grodzovsky
2021-12-22 22:14     ` Andrey Grodzovsky
2021-12-23  8:42     ` Christian König
2021-12-23  8:42       ` Christian König
2021-12-23 10:14       ` Liu, Monk
2021-12-23 10:14         ` Liu, Monk
2021-12-24  8:58         ` Deng, Emily
2021-12-24  8:58           ` Deng, Emily
2021-12-24  9:57           ` JingWen Chen
2021-12-24  9:57             ` JingWen Chen
2021-12-30 18:45             ` Andrey Grodzovsky
2021-12-30 18:45               ` Andrey Grodzovsky
2022-01-03 10:17               ` Christian König
2022-01-03 10:17                 ` Christian König
2022-01-04  9:07                 ` JingWen Chen
2022-01-04  9:07                   ` JingWen Chen
2022-01-04 10:18                   ` Christian König
2022-01-04 10:18                     ` Christian König
2022-01-04 10:49                     ` Liu, Monk
2022-01-04 10:49                       ` Liu, Monk
2022-01-04 11:36                       ` Christian König
2022-01-04 11:36                         ` Christian König
2022-01-04 16:56                         ` Andrey Grodzovsky
2022-01-04 16:56                           ` Andrey Grodzovsky
2022-01-05  7:34                           ` JingWen Chen
2022-01-05  7:34                             ` JingWen Chen
2022-01-05  7:59                             ` Christian König
2022-01-05  7:59                               ` Christian König
2022-01-05 18:24                               ` Andrey Grodzovsky
2022-01-05 18:24                                 ` Andrey Grodzovsky
2022-01-06  4:59                                 ` JingWen Chen
2022-01-06  4:59                                   ` JingWen Chen
2022-01-06  5:18                                   ` JingWen Chen
2022-01-06  5:18                                     ` JingWen Chen
2022-01-06  9:13                                     ` Christian König
2022-01-06  9:13                                       ` Christian König
2022-01-06 19:13                                     ` Andrey Grodzovsky
2022-01-06 19:13                                       ` Andrey Grodzovsky
2022-01-07  3:57                                       ` JingWen Chen
2022-01-07  3:57                                         ` JingWen Chen
2022-01-07  5:46                                         ` JingWen Chen
2022-01-07  5:46                                           ` JingWen Chen
2022-01-07 16:02                                           ` Andrey Grodzovsky
2022-01-07 16:02                                             ` Andrey Grodzovsky
2022-01-12  6:28                                             ` JingWen Chen
2022-01-12  6:28                                               ` JingWen Chen
2022-01-04 17:13                         ` Liu, Shaoyun
2022-01-04 17:13                           ` Liu, Shaoyun
2022-01-04 20:54                           ` Andrey Grodzovsky
2022-01-04 20:54                             ` Andrey Grodzovsky
2022-01-05  0:01                             ` Liu, Shaoyun
2022-01-05  0:01                               ` Liu, Shaoyun
2022-01-05  7:25                         ` JingWen Chen
2022-01-05  7:25                           ` JingWen Chen
2021-12-30 18:39           ` Andrey Grodzovsky
2021-12-30 18:39             ` Andrey Grodzovsky
2021-12-23 18:07     ` Liu, Shaoyun
2021-12-23 18:07       ` Liu, Shaoyun
2021-12-23 18:29   ` [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-23 18:29     ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.