All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] MISC fixes
@ 2021-07-16  1:34 Oak Zeng
  2021-07-16  1:34 ` [PATCH 1/5] drm/amdgpu: Fix a printing message Oak Zeng
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Oak Zeng @ 2021-07-16  1:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, Oak Zeng, hawking.zhang

Oak Zeng (5):
  drm/amdgpu: Fix a printing message
  drm/amdgpu: Change a few function names
  drm/amdkfd: Renaming dqm->packets to dqm->dpm
  drm/amdkfd: Set priv_queue to NULL after it is freed
  drm/amdkfd: Fix a concurrency issue during kfd recovery

 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c            | 16 +++++-----
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c              |  2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c              |  2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c              |  2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c              |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_device.c            |  8 +----
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 36 +++++++++++++---------
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.h  |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c    |  4 +++
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c          |  2 +-
 10 files changed, 40 insertions(+), 36 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] drm/amdgpu: Fix a printing message
  2021-07-16  1:34 [PATCH 0/5] MISC fixes Oak Zeng
@ 2021-07-16  1:34 ` Oak Zeng
  2021-07-16  7:20   ` Christian König
  2021-07-16  1:34 ` [PATCH 2/5] drm/amdgpu: Change a few function names Oak Zeng
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Oak Zeng @ 2021-07-16  1:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, Oak Zeng, hawking.zhang

The printing message "PSP loading VCN firmware" is mis-leading because
people might think driver is loading VCN firmware. Actually when this
message is printed, driver is just preparing some VCN ucode, not loading
VCN firmware yet. The actual VCN firmware loading will be in the PSP block
hw_init. Fix the printing message

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42..121ee9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -119,7 +119,7 @@ static int vcn_v1_0_sw_init(void *handle)
 		adev->firmware.ucode[AMDGPU_UCODE_ID_VCN].fw = adev->vcn.fw;
 		adev->firmware.fw_size +=
 			ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
-		DRM_INFO("PSP loading VCN firmware\n");
+		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
 	}
 
 	r = amdgpu_vcn_resume(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
index 8af567c..f4686e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
@@ -122,7 +122,7 @@ static int vcn_v2_0_sw_init(void *handle)
 		adev->firmware.ucode[AMDGPU_UCODE_ID_VCN].fw = adev->vcn.fw;
 		adev->firmware.fw_size +=
 			ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
-		DRM_INFO("PSP loading VCN firmware\n");
+		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
 	}
 
 	r = amdgpu_vcn_resume(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index 888b17d..e0c0c37 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -152,7 +152,7 @@ static int vcn_v2_5_sw_init(void *handle)
 			adev->firmware.fw_size +=
 				ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
 		}
-		DRM_INFO("PSP loading VCN firmware\n");
+		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
 	}
 
 	r = amdgpu_vcn_resume(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index c3580de..a1bbe33 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -158,7 +158,7 @@ static int vcn_v3_0_sw_init(void *handle)
 			adev->firmware.fw_size +=
 				ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
 		}
-		DRM_INFO("PSP loading VCN firmware\n");
+		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
 	}
 
 	r = amdgpu_vcn_resume(adev);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] drm/amdgpu: Change a few function names
  2021-07-16  1:34 [PATCH 0/5] MISC fixes Oak Zeng
  2021-07-16  1:34 ` [PATCH 1/5] drm/amdgpu: Fix a printing message Oak Zeng
@ 2021-07-16  1:34 ` Oak Zeng
  2021-07-16  1:34 ` [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm Oak Zeng
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Oak Zeng @ 2021-07-16  1:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, Oak Zeng, hawking.zhang

Function name "psp_np_fw_load" is not proper as people don't
know _np_fw_ means "non psp firmware". Change the function
name to psp_load_non_psp_fw for better understanding. Same
thing for function psp_execute_np_fw_load.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index d9ddb2c..8d1e2b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2351,7 +2351,7 @@ static int psp_prep_load_ip_fw_cmd_buf(struct amdgpu_firmware_info *ucode,
 	return ret;
 }
 
-static int psp_execute_np_fw_load(struct psp_context *psp,
+static int psp_execute_non_psp_fw_load(struct psp_context *psp,
 			          struct amdgpu_firmware_info *ucode)
 {
 	int ret = 0;
@@ -2387,7 +2387,7 @@ static int psp_load_smu_fw(struct psp_context *psp)
 		}
 	}
 
-	ret = psp_execute_np_fw_load(psp, ucode);
+	ret = psp_execute_non_psp_fw_load(psp, ucode);
 
 	if (ret)
 		DRM_ERROR("PSP load smu failed!\n");
@@ -2442,14 +2442,14 @@ int psp_load_fw_list(struct psp_context *psp,
 	for (i = 0; i < ucode_count; ++i) {
 		ucode = ucode_list[i];
 		psp_print_fw_hdr(psp, ucode);
-		ret = psp_execute_np_fw_load(psp, ucode);
+		ret = psp_execute_non_psp_fw_load(psp, ucode);
 		if (ret)
 			return ret;
 	}
 	return ret;
 }
 
-static int psp_np_fw_load(struct psp_context *psp)
+static int psp_load_non_psp_fw(struct psp_context *psp)
 {
 	int i, ret;
 	struct amdgpu_firmware_info *ucode;
@@ -2488,7 +2488,7 @@ static int psp_np_fw_load(struct psp_context *psp)
 
 		psp_print_fw_hdr(psp, ucode);
 
-		ret = psp_execute_np_fw_load(psp, ucode);
+		ret = psp_execute_non_psp_fw_load(psp, ucode);
 		if (ret)
 			return ret;
 
@@ -2565,7 +2565,7 @@ static int psp_load_fw(struct amdgpu_device *adev)
 	if (ret)
 		goto failed;
 
-	ret = psp_np_fw_load(psp);
+	ret = psp_load_non_psp_fw(psp);
 	if (ret)
 		goto failed;
 
@@ -2765,7 +2765,7 @@ static int psp_resume(void *handle)
 	if (ret)
 		goto failed;
 
-	ret = psp_np_fw_load(psp);
+	ret = psp_load_non_psp_fw(psp);
 	if (ret)
 		goto failed;
 
@@ -2863,7 +2863,7 @@ int psp_update_vcn_sram(struct amdgpu_device *adev, int inst_idx,
 	ucode.mc_addr = cmd_gpu_addr;
 	ucode.ucode_size = cmd_size;
 
-	return psp_execute_np_fw_load(&adev->psp, &ucode);
+	return psp_execute_non_psp_fw_load(&adev->psp, &ucode);
 }
 
 int psp_ring_cmd_submit(struct psp_context *psp,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm
  2021-07-16  1:34 [PATCH 0/5] MISC fixes Oak Zeng
  2021-07-16  1:34 ` [PATCH 1/5] drm/amdgpu: Fix a printing message Oak Zeng
  2021-07-16  1:34 ` [PATCH 2/5] drm/amdgpu: Change a few function names Oak Zeng
@ 2021-07-16  1:34 ` Oak Zeng
  2021-07-16 14:05   ` Felix Kuehling
  2021-07-16  1:34 ` [PATCH 4/5] drm/amdkfd: Set priv_queue to NULL after it is freed Oak Zeng
  2021-07-16  1:34 ` [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery Oak Zeng
  4 siblings, 1 reply; 12+ messages in thread
From: Oak Zeng @ 2021-07-16  1:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, Oak Zeng, hawking.zhang

Renaming packets to dpm (device packet manager) to
reflect the real meaning of this variable.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c            |  2 +-
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 26 +++++++++++-----------
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.h  |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c          |  2 +-
 4 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 9e4a05e..c51402b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1390,7 +1390,7 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev)
 		return -EINVAL;
 	}
 
-	r = pm_debugfs_hang_hws(&dev->dqm->packets);
+	r = pm_debugfs_hang_hws(&dev->dqm->dpm);
 	if (!r)
 		r = dqm_debugfs_execute_queues(dev->dqm);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 16a1713..f2984d3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -260,7 +260,7 @@ static int allocate_vmid(struct device_queue_manager *dqm,
 static int flush_texture_cache_nocpsch(struct kfd_dev *kdev,
 				struct qcm_process_device *qpd)
 {
-	const struct packet_manager_funcs *pmf = qpd->dqm->packets.pmf;
+	const struct packet_manager_funcs *pmf = qpd->dqm->dpm.pmf;
 	int ret;
 
 	if (!qpd->ib_kaddr)
@@ -1000,7 +1000,7 @@ static int start_nocpsch(struct device_queue_manager *dqm)
 	init_interrupts(dqm);
 	
 	if (dqm->dev->device_info->asic_family == CHIP_HAWAII)
-		return pm_init(&dqm->packets, dqm);
+		return pm_init(&dqm->dpm, dqm);
 	dqm->sched_running = true;
 
 	return 0;
@@ -1009,7 +1009,7 @@ static int start_nocpsch(struct device_queue_manager *dqm)
 static int stop_nocpsch(struct device_queue_manager *dqm)
 {
 	if (dqm->dev->device_info->asic_family == CHIP_HAWAII)
-		pm_uninit(&dqm->packets, false);
+		pm_uninit(&dqm->dpm, false);
 	dqm->sched_running = false;
 
 	return 0;
@@ -1124,7 +1124,7 @@ static int set_sched_resources(struct device_queue_manager *dqm)
 			"queue mask: 0x%8llX\n",
 			res.vmid_mask, res.queue_mask);
 
-	return pm_send_set_resources(&dqm->packets, &res);
+	return pm_send_set_resources(&dqm->dpm, &res);
 }
 
 static int initialize_cpsch(struct device_queue_manager *dqm)
@@ -1164,7 +1164,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 
 	retval = 0;
 
-	retval = pm_init(&dqm->packets, dqm);
+	retval = pm_init(&dqm->dpm, dqm);
 	if (retval)
 		goto fail_packet_manager_init;
 
@@ -1197,7 +1197,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 	return 0;
 fail_allocate_vidmem:
 fail_set_sched_resources:
-	pm_uninit(&dqm->packets, false);
+	pm_uninit(&dqm->dpm, false);
 fail_packet_manager_init:
 	return retval;
 }
@@ -1213,10 +1213,10 @@ static int stop_cpsch(struct device_queue_manager *dqm)
 	dqm->sched_running = false;
 	dqm_unlock(dqm);
 
-	pm_release_ib(&dqm->packets);
+	pm_release_ib(&dqm->dpm);
 
 	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
-	pm_uninit(&dqm->packets, hanging);
+	pm_uninit(&dqm->dpm, hanging);
 
 	return 0;
 }
@@ -1390,7 +1390,7 @@ static int map_queues_cpsch(struct device_queue_manager *dqm)
 	if (dqm->active_runlist)
 		return 0;
 
-	retval = pm_send_runlist(&dqm->packets, &dqm->queues);
+	retval = pm_send_runlist(&dqm->dpm, &dqm->queues);
 	pr_debug("%s sent runlist\n", __func__);
 	if (retval) {
 		pr_err("failed to execute runlist\n");
@@ -1416,13 +1416,13 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	if (!dqm->active_runlist)
 		return retval;
 
-	retval = pm_send_unmap_queue(&dqm->packets, KFD_QUEUE_TYPE_COMPUTE,
+	retval = pm_send_unmap_queue(&dqm->dpm, KFD_QUEUE_TYPE_COMPUTE,
 			filter, filter_param, false, 0);
 	if (retval)
 		return retval;
 
 	*dqm->fence_addr = KFD_FENCE_INIT;
-	pm_send_query_status(&dqm->packets, dqm->fence_gpu_addr,
+	pm_send_query_status(&dqm->dpm, dqm->fence_gpu_addr,
 				KFD_FENCE_COMPLETED);
 	/* should be timed out */
 	retval = amdkfd_fence_wait_timeout(dqm->fence_addr, KFD_FENCE_COMPLETED,
@@ -1448,14 +1448,14 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	 * check those fields
 	 */
 	mqd_mgr = dqm->mqd_mgrs[KFD_MQD_TYPE_HIQ];
-	if (mqd_mgr->read_doorbell_id(dqm->packets.priv_queue->queue->mqd)) {
+	if (mqd_mgr->read_doorbell_id(dqm->dpm.priv_queue->queue->mqd)) {
 		pr_err("HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out\n");
 		while (halt_if_hws_hang)
 			schedule();
 		return -ETIME;
 	}
 
-	pm_release_ib(&dqm->packets);
+	pm_release_ib(&dqm->dpm);
 	dqm->active_runlist = false;
 
 	return retval;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 71e2fde..14479e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -169,7 +169,7 @@ struct device_queue_manager {
 	struct device_queue_manager_asic_ops asic_ops;
 
 	struct mqd_manager	*mqd_mgrs[KFD_MQD_TYPE_MAX];
-	struct packet_manager	packets;
+	struct packet_manager	dpm;
 	struct kfd_dev		*dev;
 	struct mutex		lock_hidden; /* use dqm_lock/unlock(dqm) */
 	struct list_head	queues;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index b1ce072..748e82b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1630,7 +1630,7 @@ int kfd_debugfs_rls_by_device(struct seq_file *m, void *data)
 		}
 
 		seq_printf(m, "Node %u, gpu_id %x:\n", i++, dev->gpu->id);
-		r = pm_debugfs_runlist(m, &dev->gpu->dqm->packets);
+		r = pm_debugfs_runlist(m, &dev->gpu->dqm->dpm);
 		if (r)
 			break;
 	}
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] drm/amdkfd: Set priv_queue to NULL after it is freed
  2021-07-16  1:34 [PATCH 0/5] MISC fixes Oak Zeng
                   ` (2 preceding siblings ...)
  2021-07-16  1:34 ` [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm Oak Zeng
@ 2021-07-16  1:34 ` Oak Zeng
  2021-07-16 14:06   ` Felix Kuehling
  2021-07-16  1:34 ` [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery Oak Zeng
  4 siblings, 1 reply; 12+ messages in thread
From: Oak Zeng @ 2021-07-16  1:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, Oak Zeng, hawking.zhang

This variable will be used to determine whether packet
manager is initialized or not, in a future patch.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
index d8e940f..b130cc0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
@@ -278,6 +278,7 @@ void pm_uninit(struct packet_manager *pm, bool hanging)
 {
 	mutex_destroy(&pm->lock);
 	kernel_queue_uninit(pm->priv_queue, hanging);
+	pm->priv_queue = NULL;
 }
 
 int pm_send_set_resources(struct packet_manager *pm,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery
  2021-07-16  1:34 [PATCH 0/5] MISC fixes Oak Zeng
                   ` (3 preceding siblings ...)
  2021-07-16  1:34 ` [PATCH 4/5] drm/amdkfd: Set priv_queue to NULL after it is freed Oak Zeng
@ 2021-07-16  1:34 ` Oak Zeng
  2021-07-16  3:27   ` Lazar, Lijo
  2021-07-16 14:09   ` Felix Kuehling
  4 siblings, 2 replies; 12+ messages in thread
From: Oak Zeng @ 2021-07-16  1:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, Oak Zeng, hawking.zhang

start_cpsch and stop_cpsch can be called during kfd device
initialization or during gpu reset/recovery. So they can
run concurrently. Currently in start_cpsch and stop_cpsch,
pm_init and pm_uninit is not protected by the dpm lock.
Imagine such a case that user use packet manager's function
to submit a pm4 packet to hang hws (ie through command
cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id | sudo tee
/sys/kernel/debug/kfd/hang_hws), while kfd device is under
device reset/recovery so packet manager can be not initialized.
There will be unpredictable protection fault in such case.

This patch moves pm_init/uninit inside the dpm lock and check
packet manager is initialized before using packet manager
function.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c               |  8 +-------
 drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++++++++--
 drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c       |  3 +++
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index c51402b..adc2342 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1383,18 +1383,12 @@ void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, uint32_t throttle_bitmask)
  */
 int kfd_debugfs_hang_hws(struct kfd_dev *dev)
 {
-	int r = 0;
-
 	if (dev->dqm->sched_policy != KFD_SCHED_POLICY_HWS) {
 		pr_err("HWS is not enabled");
 		return -EINVAL;
 	}
 
-	r = pm_debugfs_hang_hws(&dev->dqm->dpm);
-	if (!r)
-		r = dqm_debugfs_execute_queues(dev->dqm);
-
-	return r;
+	return dqm_debugfs_execute_queues(dev->dqm);
 }
 
 #endif
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index f2984d3..44136d2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1164,6 +1164,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 
 	retval = 0;
 
+	dqm_lock(dqm);
 	retval = pm_init(&dqm->dpm, dqm);
 	if (retval)
 		goto fail_packet_manager_init;
@@ -1186,7 +1187,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
 
 	init_interrupts(dqm);
 
-	dqm_lock(dqm);
 	/* clear hang status when driver try to start the hw scheduler */
 	dqm->is_hws_hang = false;
 	dqm->is_resetting = false;
@@ -1199,6 +1199,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 fail_set_sched_resources:
 	pm_uninit(&dqm->dpm, false);
 fail_packet_manager_init:
+	dqm_unlock(dqm);
 	return retval;
 }
 
@@ -1211,12 +1212,12 @@ static int stop_cpsch(struct device_queue_manager *dqm)
 		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
 	hanging = dqm->is_hws_hang || dqm->is_resetting;
 	dqm->sched_running = false;
-	dqm_unlock(dqm);
 
 	pm_release_ib(&dqm->dpm);
 
 	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
 	pm_uninit(&dqm->dpm, hanging);
+	dqm_unlock(dqm);
 
 	return 0;
 }
@@ -2104,6 +2105,11 @@ int dqm_debugfs_execute_queues(struct device_queue_manager *dqm)
 	int r = 0;
 
 	dqm_lock(dqm);
+	r = pm_debugfs_hang_hws(&dqm->dpm);
+	if (r) {
+		dqm_unlock(dqm);
+		return r;
+	}
 	dqm->active_runlist = true;
 	r = execute_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
 	dqm_unlock(dqm);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
index b130cc0..0123e64 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
@@ -448,6 +448,9 @@ int pm_debugfs_hang_hws(struct packet_manager *pm)
 	uint32_t *buffer, size;
 	int r = 0;
 
+	if (!pm->priv_queue)
+		return -EBUSY;
+
 	size = pm->pmf->query_status_size;
 	mutex_lock(&pm->lock);
 	kq_acquire_packet_buffer(pm->priv_queue,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery
  2021-07-16  1:34 ` [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery Oak Zeng
@ 2021-07-16  3:27   ` Lazar, Lijo
  2021-07-16 14:09   ` Felix Kuehling
  1 sibling, 0 replies; 12+ messages in thread
From: Lazar, Lijo @ 2021-07-16  3:27 UTC (permalink / raw)
  To: Oak Zeng, amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, hawking.zhang



On 7/16/2021 7:04 AM, Oak Zeng wrote:
> start_cpsch and stop_cpsch can be called during kfd device
> initialization or during gpu reset/recovery. So they can
> run concurrently. Currently in start_cpsch and stop_cpsch,
> pm_init and pm_uninit is not protected by the dpm lock.
> Imagine such a case that user use packet manager's function
> to submit a pm4 packet to hang hws (ie through command
> cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id | sudo tee
> /sys/kernel/debug/kfd/hang_hws), while kfd device is under
> device reset/recovery so packet manager can be not initialized.
> There will be unpredictable protection fault in such case.
> 
> This patch moves pm_init/uninit inside the dpm lock and check
> packet manager is initialized before using packet manager
> function.
> 
> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c               |  8 +-------
>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++++++++--
>   drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c       |  3 +++
>   3 files changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index c51402b..adc2342 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -1383,18 +1383,12 @@ void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, uint32_t throttle_bitmask)
>    */
>   int kfd_debugfs_hang_hws(struct kfd_dev *dev)
>   {
> -	int r = 0;
> -
>   	if (dev->dqm->sched_policy != KFD_SCHED_POLICY_HWS) {
>   		pr_err("HWS is not enabled");
>   		return -EINVAL;
>   	}
>   
> -	r = pm_debugfs_hang_hws(&dev->dqm->dpm);
> -	if (!r)
> -		r = dqm_debugfs_execute_queues(dev->dqm);
> -
> -	return r;
> +	return dqm_debugfs_execute_queues(dev->dqm);
>   }
>   
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index f2984d3..44136d2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1164,6 +1164,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>   
>   	retval = 0;
>   
> +	dqm_lock(dqm);
>   	retval = pm_init(&dqm->dpm, dqm);
>   	if (retval)
>   		goto fail_packet_manager_init;
> @@ -1186,7 +1187,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
>   
>   	init_interrupts(dqm);
>   
> -	dqm_lock(dqm);
>   	/* clear hang status when driver try to start the hw scheduler */
>   	dqm->is_hws_hang = false;
>   	dqm->is_resetting = false;
> @@ -1199,6 +1199,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>   fail_set_sched_resources:
>   	pm_uninit(&dqm->dpm, false);
>   fail_packet_manager_init:
> +	dqm_unlock(dqm);
>   	return retval;
>   }
>   
> @@ -1211,12 +1212,12 @@ static int stop_cpsch(struct device_queue_manager *dqm)
>   		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>   	hanging = dqm->is_hws_hang || dqm->is_resetting;
>   	dqm->sched_running = false;
> -	dqm_unlock(dqm);
>   
>   	pm_release_ib(&dqm->dpm);
>   
>   	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
>   	pm_uninit(&dqm->dpm, hanging);
> +	dqm_unlock(dqm);
>   
>   	return 0;
>   }
> @@ -2104,6 +2105,11 @@ int dqm_debugfs_execute_queues(struct device_queue_manager *dqm)
>   	int r = 0;
>   
>   	dqm_lock(dqm);
> +	r = pm_debugfs_hang_hws(&dqm->dpm);

execute_queues look like a misnomer now as it combines the logic to hang 
HWS.

> +	if (r) {
> +		dqm_unlock(dqm);
> +		return r;
> +	}
>   	dqm->active_runlist = true;
>   	r = execute_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>   	dqm_unlock(dqm);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> index b130cc0..0123e64 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> @@ -448,6 +448,9 @@ int pm_debugfs_hang_hws(struct packet_manager *pm)
>   	uint32_t *buffer, size;
>   	int r = 0;
>   
> +	if (!pm->priv_queue)
> +		return -EBUSY;
> +

EAGAIN may be more appropriate

Thanks,
Lijo

>   	size = pm->pmf->query_status_size;
>   	mutex_lock(&pm->lock);
>   	kq_acquire_packet_buffer(pm->priv_queue,
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix a printing message
  2021-07-16  1:34 ` [PATCH 1/5] drm/amdgpu: Fix a printing message Oak Zeng
@ 2021-07-16  7:20   ` Christian König
  0 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2021-07-16  7:20 UTC (permalink / raw)
  To: Oak Zeng, amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, hawking.zhang

Am 16.07.21 um 03:34 schrieb Oak Zeng:
> The printing message "PSP loading VCN firmware" is mis-leading because
> people might think driver is loading VCN firmware. Actually when this
> message is printed, driver is just preparing some VCN ucode, not loading
> VCN firmware yet. The actual VCN firmware loading will be in the PSP block
> hw_init. Fix the printing message
>
> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com> for patch #1 and #2.

Acked-by: Christian König <christian.koenig@amd.com> for the rest of the 
series.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c | 2 +-
>   drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 2 +-
>   drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 2 +-
>   drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 2 +-
>   4 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> index 284bb42..121ee9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> @@ -119,7 +119,7 @@ static int vcn_v1_0_sw_init(void *handle)
>   		adev->firmware.ucode[AMDGPU_UCODE_ID_VCN].fw = adev->vcn.fw;
>   		adev->firmware.fw_size +=
>   			ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
> -		DRM_INFO("PSP loading VCN firmware\n");
> +		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
>   	}
>   
>   	r = amdgpu_vcn_resume(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
> index 8af567c..f4686e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
> @@ -122,7 +122,7 @@ static int vcn_v2_0_sw_init(void *handle)
>   		adev->firmware.ucode[AMDGPU_UCODE_ID_VCN].fw = adev->vcn.fw;
>   		adev->firmware.fw_size +=
>   			ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
> -		DRM_INFO("PSP loading VCN firmware\n");
> +		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
>   	}
>   
>   	r = amdgpu_vcn_resume(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
> index 888b17d..e0c0c37 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
> @@ -152,7 +152,7 @@ static int vcn_v2_5_sw_init(void *handle)
>   			adev->firmware.fw_size +=
>   				ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
>   		}
> -		DRM_INFO("PSP loading VCN firmware\n");
> +		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
>   	}
>   
>   	r = amdgpu_vcn_resume(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> index c3580de..a1bbe33 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> @@ -158,7 +158,7 @@ static int vcn_v3_0_sw_init(void *handle)
>   			adev->firmware.fw_size +=
>   				ALIGN(le32_to_cpu(hdr->ucode_size_bytes), PAGE_SIZE);
>   		}
> -		DRM_INFO("PSP loading VCN firmware\n");
> +		dev_info(adev->dev, "Will use PSP to load VCN firmware\n");
>   	}
>   
>   	r = amdgpu_vcn_resume(adev);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm
  2021-07-16  1:34 ` [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm Oak Zeng
@ 2021-07-16 14:05   ` Felix Kuehling
  0 siblings, 0 replies; 12+ messages in thread
From: Felix Kuehling @ 2021-07-16 14:05 UTC (permalink / raw)
  To: Oak Zeng, amd-gfx; +Cc: feifei.xu, leo.liu, hawking.zhang

Am 2021-07-15 um 9:34 p.m. schrieb Oak Zeng:
> Renaming packets to dpm (device packet manager) to
> reflect the real meaning of this variable.

I don't think introducing another new acronym is helpful. Also "dpm" and
"dqm" are visually too similar. Other places use "pm" for packet
manager. If you feel "pm" is too short or too ambiguous, how about
"pmgr" or "packet_mgr"?

Regards,
  Felix


>
> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c            |  2 +-
>  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 26 +++++++++++-----------
>  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.h  |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c          |  2 +-
>  4 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 9e4a05e..c51402b 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -1390,7 +1390,7 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev)
>  		return -EINVAL;
>  	}
>  
> -	r = pm_debugfs_hang_hws(&dev->dqm->packets);
> +	r = pm_debugfs_hang_hws(&dev->dqm->dpm);
>  	if (!r)
>  		r = dqm_debugfs_execute_queues(dev->dqm);
>  
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 16a1713..f2984d3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -260,7 +260,7 @@ static int allocate_vmid(struct device_queue_manager *dqm,
>  static int flush_texture_cache_nocpsch(struct kfd_dev *kdev,
>  				struct qcm_process_device *qpd)
>  {
> -	const struct packet_manager_funcs *pmf = qpd->dqm->packets.pmf;
> +	const struct packet_manager_funcs *pmf = qpd->dqm->dpm.pmf;
>  	int ret;
>  
>  	if (!qpd->ib_kaddr)
> @@ -1000,7 +1000,7 @@ static int start_nocpsch(struct device_queue_manager *dqm)
>  	init_interrupts(dqm);
>  	
>  	if (dqm->dev->device_info->asic_family == CHIP_HAWAII)
> -		return pm_init(&dqm->packets, dqm);
> +		return pm_init(&dqm->dpm, dqm);
>  	dqm->sched_running = true;
>  
>  	return 0;
> @@ -1009,7 +1009,7 @@ static int start_nocpsch(struct device_queue_manager *dqm)
>  static int stop_nocpsch(struct device_queue_manager *dqm)
>  {
>  	if (dqm->dev->device_info->asic_family == CHIP_HAWAII)
> -		pm_uninit(&dqm->packets, false);
> +		pm_uninit(&dqm->dpm, false);
>  	dqm->sched_running = false;
>  
>  	return 0;
> @@ -1124,7 +1124,7 @@ static int set_sched_resources(struct device_queue_manager *dqm)
>  			"queue mask: 0x%8llX\n",
>  			res.vmid_mask, res.queue_mask);
>  
> -	return pm_send_set_resources(&dqm->packets, &res);
> +	return pm_send_set_resources(&dqm->dpm, &res);
>  }
>  
>  static int initialize_cpsch(struct device_queue_manager *dqm)
> @@ -1164,7 +1164,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  
>  	retval = 0;
>  
> -	retval = pm_init(&dqm->packets, dqm);
> +	retval = pm_init(&dqm->dpm, dqm);
>  	if (retval)
>  		goto fail_packet_manager_init;
>  
> @@ -1197,7 +1197,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  	return 0;
>  fail_allocate_vidmem:
>  fail_set_sched_resources:
> -	pm_uninit(&dqm->packets, false);
> +	pm_uninit(&dqm->dpm, false);
>  fail_packet_manager_init:
>  	return retval;
>  }
> @@ -1213,10 +1213,10 @@ static int stop_cpsch(struct device_queue_manager *dqm)
>  	dqm->sched_running = false;
>  	dqm_unlock(dqm);
>  
> -	pm_release_ib(&dqm->packets);
> +	pm_release_ib(&dqm->dpm);
>  
>  	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> -	pm_uninit(&dqm->packets, hanging);
> +	pm_uninit(&dqm->dpm, hanging);
>  
>  	return 0;
>  }
> @@ -1390,7 +1390,7 @@ static int map_queues_cpsch(struct device_queue_manager *dqm)
>  	if (dqm->active_runlist)
>  		return 0;
>  
> -	retval = pm_send_runlist(&dqm->packets, &dqm->queues);
> +	retval = pm_send_runlist(&dqm->dpm, &dqm->queues);
>  	pr_debug("%s sent runlist\n", __func__);
>  	if (retval) {
>  		pr_err("failed to execute runlist\n");
> @@ -1416,13 +1416,13 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>  	if (!dqm->active_runlist)
>  		return retval;
>  
> -	retval = pm_send_unmap_queue(&dqm->packets, KFD_QUEUE_TYPE_COMPUTE,
> +	retval = pm_send_unmap_queue(&dqm->dpm, KFD_QUEUE_TYPE_COMPUTE,
>  			filter, filter_param, false, 0);
>  	if (retval)
>  		return retval;
>  
>  	*dqm->fence_addr = KFD_FENCE_INIT;
> -	pm_send_query_status(&dqm->packets, dqm->fence_gpu_addr,
> +	pm_send_query_status(&dqm->dpm, dqm->fence_gpu_addr,
>  				KFD_FENCE_COMPLETED);
>  	/* should be timed out */
>  	retval = amdkfd_fence_wait_timeout(dqm->fence_addr, KFD_FENCE_COMPLETED,
> @@ -1448,14 +1448,14 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>  	 * check those fields
>  	 */
>  	mqd_mgr = dqm->mqd_mgrs[KFD_MQD_TYPE_HIQ];
> -	if (mqd_mgr->read_doorbell_id(dqm->packets.priv_queue->queue->mqd)) {
> +	if (mqd_mgr->read_doorbell_id(dqm->dpm.priv_queue->queue->mqd)) {
>  		pr_err("HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out\n");
>  		while (halt_if_hws_hang)
>  			schedule();
>  		return -ETIME;
>  	}
>  
> -	pm_release_ib(&dqm->packets);
> +	pm_release_ib(&dqm->dpm);
>  	dqm->active_runlist = false;
>  
>  	return retval;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> index 71e2fde..14479e8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> @@ -169,7 +169,7 @@ struct device_queue_manager {
>  	struct device_queue_manager_asic_ops asic_ops;
>  
>  	struct mqd_manager	*mqd_mgrs[KFD_MQD_TYPE_MAX];
> -	struct packet_manager	packets;
> +	struct packet_manager	dpm;
>  	struct kfd_dev		*dev;
>  	struct mutex		lock_hidden; /* use dqm_lock/unlock(dqm) */
>  	struct list_head	queues;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index b1ce072..748e82b 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -1630,7 +1630,7 @@ int kfd_debugfs_rls_by_device(struct seq_file *m, void *data)
>  		}
>  
>  		seq_printf(m, "Node %u, gpu_id %x:\n", i++, dev->gpu->id);
> -		r = pm_debugfs_runlist(m, &dev->gpu->dqm->packets);
> +		r = pm_debugfs_runlist(m, &dev->gpu->dqm->dpm);
>  		if (r)
>  			break;
>  	}
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/5] drm/amdkfd: Set priv_queue to NULL after it is freed
  2021-07-16  1:34 ` [PATCH 4/5] drm/amdkfd: Set priv_queue to NULL after it is freed Oak Zeng
@ 2021-07-16 14:06   ` Felix Kuehling
  0 siblings, 0 replies; 12+ messages in thread
From: Felix Kuehling @ 2021-07-16 14:06 UTC (permalink / raw)
  To: Oak Zeng, amd-gfx; +Cc: feifei.xu, leo.liu, hawking.zhang

Am 2021-07-15 um 9:34 p.m. schrieb Oak Zeng:
> This variable will be used to determine whether packet
> manager is initialized or not, in a future patch.
>
> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> index d8e940f..b130cc0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> @@ -278,6 +278,7 @@ void pm_uninit(struct packet_manager *pm, bool hanging)
>  {
>  	mutex_destroy(&pm->lock);
>  	kernel_queue_uninit(pm->priv_queue, hanging);
> +	pm->priv_queue = NULL;
>  }
>  
>  int pm_send_set_resources(struct packet_manager *pm,
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery
  2021-07-16  1:34 ` [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery Oak Zeng
  2021-07-16  3:27   ` Lazar, Lijo
@ 2021-07-16 14:09   ` Felix Kuehling
  1 sibling, 0 replies; 12+ messages in thread
From: Felix Kuehling @ 2021-07-16 14:09 UTC (permalink / raw)
  To: Oak Zeng, amd-gfx; +Cc: feifei.xu, leo.liu, hawking.zhang


Am 2021-07-15 um 9:34 p.m. schrieb Oak Zeng:
> start_cpsch and stop_cpsch can be called during kfd device
> initialization or during gpu reset/recovery. So they can
> run concurrently. Currently in start_cpsch and stop_cpsch,
> pm_init and pm_uninit is not protected by the dpm lock.
> Imagine such a case that user use packet manager's function
> to submit a pm4 packet to hang hws (ie through command
> cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id | sudo tee
> /sys/kernel/debug/kfd/hang_hws), while kfd device is under
> device reset/recovery so packet manager can be not initialized.
> There will be unpredictable protection fault in such case.
>
> This patch moves pm_init/uninit inside the dpm lock and check
> packet manager is initialized before using packet manager
> function.
>
> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c               |  8 +-------
>  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++++++++--
>  drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c       |  3 +++
>  3 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index c51402b..adc2342 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -1383,18 +1383,12 @@ void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, uint32_t throttle_bitmask)
>   */
>  int kfd_debugfs_hang_hws(struct kfd_dev *dev)
>  {
> -	int r = 0;
> -
>  	if (dev->dqm->sched_policy != KFD_SCHED_POLICY_HWS) {
>  		pr_err("HWS is not enabled");
>  		return -EINVAL;
>  	}
>  
> -	r = pm_debugfs_hang_hws(&dev->dqm->dpm);
> -	if (!r)
> -		r = dqm_debugfs_execute_queues(dev->dqm);
> -
> -	return r;
> +	return dqm_debugfs_execute_queues(dev->dqm);

This function should now probably be renamed to something like
dqm_debugfs_hang_hws. Other than that, this patch looks good to me.

Regards,
  Felix


>  }
>  
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index f2984d3..44136d2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1164,6 +1164,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  
>  	retval = 0;
>  
> +	dqm_lock(dqm);
>  	retval = pm_init(&dqm->dpm, dqm);
>  	if (retval)
>  		goto fail_packet_manager_init;
> @@ -1186,7 +1187,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  
>  	init_interrupts(dqm);
>  
> -	dqm_lock(dqm);
>  	/* clear hang status when driver try to start the hw scheduler */
>  	dqm->is_hws_hang = false;
>  	dqm->is_resetting = false;
> @@ -1199,6 +1199,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  fail_set_sched_resources:
>  	pm_uninit(&dqm->dpm, false);
>  fail_packet_manager_init:
> +	dqm_unlock(dqm);
>  	return retval;
>  }
>  
> @@ -1211,12 +1212,12 @@ static int stop_cpsch(struct device_queue_manager *dqm)
>  		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>  	hanging = dqm->is_hws_hang || dqm->is_resetting;
>  	dqm->sched_running = false;
> -	dqm_unlock(dqm);
>  
>  	pm_release_ib(&dqm->dpm);
>  
>  	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
>  	pm_uninit(&dqm->dpm, hanging);
> +	dqm_unlock(dqm);
>  
>  	return 0;
>  }
> @@ -2104,6 +2105,11 @@ int dqm_debugfs_execute_queues(struct device_queue_manager *dqm)
>  	int r = 0;
>  
>  	dqm_lock(dqm);
> +	r = pm_debugfs_hang_hws(&dqm->dpm);
> +	if (r) {
> +		dqm_unlock(dqm);
> +		return r;
> +	}
>  	dqm->active_runlist = true;
>  	r = execute_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>  	dqm_unlock(dqm);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> index b130cc0..0123e64 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> @@ -448,6 +448,9 @@ int pm_debugfs_hang_hws(struct packet_manager *pm)
>  	uint32_t *buffer, size;
>  	int r = 0;
>  
> +	if (!pm->priv_queue)
> +		return -EBUSY;
> +
>  	size = pm->pmf->query_status_size;
>  	mutex_lock(&pm->lock);
>  	kq_acquire_packet_buffer(pm->priv_queue,
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm
  2021-07-16  1:25 [PATCH 0/5] MISC fixes Oak Zeng
@ 2021-07-16  1:25 ` Oak Zeng
  0 siblings, 0 replies; 12+ messages in thread
From: Oak Zeng @ 2021-07-16  1:25 UTC (permalink / raw)
  To: amd-gfx; +Cc: feifei.xu, Felix.Kuehling, leo.liu, Oak Zeng, hawking.zhang

Renaming packets to dpm (device packet manager) to
reflect the real meaning of this variable.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c            |  2 +-
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 26 +++++++++++-----------
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.h  |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c          |  2 +-
 4 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 9e4a05e..c51402b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1390,7 +1390,7 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev)
 		return -EINVAL;
 	}
 
-	r = pm_debugfs_hang_hws(&dev->dqm->packets);
+	r = pm_debugfs_hang_hws(&dev->dqm->dpm);
 	if (!r)
 		r = dqm_debugfs_execute_queues(dev->dqm);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 16a1713..f2984d3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -260,7 +260,7 @@ static int allocate_vmid(struct device_queue_manager *dqm,
 static int flush_texture_cache_nocpsch(struct kfd_dev *kdev,
 				struct qcm_process_device *qpd)
 {
-	const struct packet_manager_funcs *pmf = qpd->dqm->packets.pmf;
+	const struct packet_manager_funcs *pmf = qpd->dqm->dpm.pmf;
 	int ret;
 
 	if (!qpd->ib_kaddr)
@@ -1000,7 +1000,7 @@ static int start_nocpsch(struct device_queue_manager *dqm)
 	init_interrupts(dqm);
 	
 	if (dqm->dev->device_info->asic_family == CHIP_HAWAII)
-		return pm_init(&dqm->packets, dqm);
+		return pm_init(&dqm->dpm, dqm);
 	dqm->sched_running = true;
 
 	return 0;
@@ -1009,7 +1009,7 @@ static int start_nocpsch(struct device_queue_manager *dqm)
 static int stop_nocpsch(struct device_queue_manager *dqm)
 {
 	if (dqm->dev->device_info->asic_family == CHIP_HAWAII)
-		pm_uninit(&dqm->packets, false);
+		pm_uninit(&dqm->dpm, false);
 	dqm->sched_running = false;
 
 	return 0;
@@ -1124,7 +1124,7 @@ static int set_sched_resources(struct device_queue_manager *dqm)
 			"queue mask: 0x%8llX\n",
 			res.vmid_mask, res.queue_mask);
 
-	return pm_send_set_resources(&dqm->packets, &res);
+	return pm_send_set_resources(&dqm->dpm, &res);
 }
 
 static int initialize_cpsch(struct device_queue_manager *dqm)
@@ -1164,7 +1164,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 
 	retval = 0;
 
-	retval = pm_init(&dqm->packets, dqm);
+	retval = pm_init(&dqm->dpm, dqm);
 	if (retval)
 		goto fail_packet_manager_init;
 
@@ -1197,7 +1197,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 	return 0;
 fail_allocate_vidmem:
 fail_set_sched_resources:
-	pm_uninit(&dqm->packets, false);
+	pm_uninit(&dqm->dpm, false);
 fail_packet_manager_init:
 	return retval;
 }
@@ -1213,10 +1213,10 @@ static int stop_cpsch(struct device_queue_manager *dqm)
 	dqm->sched_running = false;
 	dqm_unlock(dqm);
 
-	pm_release_ib(&dqm->packets);
+	pm_release_ib(&dqm->dpm);
 
 	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
-	pm_uninit(&dqm->packets, hanging);
+	pm_uninit(&dqm->dpm, hanging);
 
 	return 0;
 }
@@ -1390,7 +1390,7 @@ static int map_queues_cpsch(struct device_queue_manager *dqm)
 	if (dqm->active_runlist)
 		return 0;
 
-	retval = pm_send_runlist(&dqm->packets, &dqm->queues);
+	retval = pm_send_runlist(&dqm->dpm, &dqm->queues);
 	pr_debug("%s sent runlist\n", __func__);
 	if (retval) {
 		pr_err("failed to execute runlist\n");
@@ -1416,13 +1416,13 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	if (!dqm->active_runlist)
 		return retval;
 
-	retval = pm_send_unmap_queue(&dqm->packets, KFD_QUEUE_TYPE_COMPUTE,
+	retval = pm_send_unmap_queue(&dqm->dpm, KFD_QUEUE_TYPE_COMPUTE,
 			filter, filter_param, false, 0);
 	if (retval)
 		return retval;
 
 	*dqm->fence_addr = KFD_FENCE_INIT;
-	pm_send_query_status(&dqm->packets, dqm->fence_gpu_addr,
+	pm_send_query_status(&dqm->dpm, dqm->fence_gpu_addr,
 				KFD_FENCE_COMPLETED);
 	/* should be timed out */
 	retval = amdkfd_fence_wait_timeout(dqm->fence_addr, KFD_FENCE_COMPLETED,
@@ -1448,14 +1448,14 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	 * check those fields
 	 */
 	mqd_mgr = dqm->mqd_mgrs[KFD_MQD_TYPE_HIQ];
-	if (mqd_mgr->read_doorbell_id(dqm->packets.priv_queue->queue->mqd)) {
+	if (mqd_mgr->read_doorbell_id(dqm->dpm.priv_queue->queue->mqd)) {
 		pr_err("HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out\n");
 		while (halt_if_hws_hang)
 			schedule();
 		return -ETIME;
 	}
 
-	pm_release_ib(&dqm->packets);
+	pm_release_ib(&dqm->dpm);
 	dqm->active_runlist = false;
 
 	return retval;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 71e2fde..14479e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -169,7 +169,7 @@ struct device_queue_manager {
 	struct device_queue_manager_asic_ops asic_ops;
 
 	struct mqd_manager	*mqd_mgrs[KFD_MQD_TYPE_MAX];
-	struct packet_manager	packets;
+	struct packet_manager	dpm;
 	struct kfd_dev		*dev;
 	struct mutex		lock_hidden; /* use dqm_lock/unlock(dqm) */
 	struct list_head	queues;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index b1ce072..748e82b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1630,7 +1630,7 @@ int kfd_debugfs_rls_by_device(struct seq_file *m, void *data)
 		}
 
 		seq_printf(m, "Node %u, gpu_id %x:\n", i++, dev->gpu->id);
-		r = pm_debugfs_runlist(m, &dev->gpu->dqm->packets);
+		r = pm_debugfs_runlist(m, &dev->gpu->dqm->dpm);
 		if (r)
 			break;
 	}
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-07-16 14:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-16  1:34 [PATCH 0/5] MISC fixes Oak Zeng
2021-07-16  1:34 ` [PATCH 1/5] drm/amdgpu: Fix a printing message Oak Zeng
2021-07-16  7:20   ` Christian König
2021-07-16  1:34 ` [PATCH 2/5] drm/amdgpu: Change a few function names Oak Zeng
2021-07-16  1:34 ` [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm Oak Zeng
2021-07-16 14:05   ` Felix Kuehling
2021-07-16  1:34 ` [PATCH 4/5] drm/amdkfd: Set priv_queue to NULL after it is freed Oak Zeng
2021-07-16 14:06   ` Felix Kuehling
2021-07-16  1:34 ` [PATCH 5/5] drm/amdkfd: Fix a concurrency issue during kfd recovery Oak Zeng
2021-07-16  3:27   ` Lazar, Lijo
2021-07-16 14:09   ` Felix Kuehling
  -- strict thread matches above, loose matches on Subject: below --
2021-07-16  1:25 [PATCH 0/5] MISC fixes Oak Zeng
2021-07-16  1:25 ` [PATCH 3/5] drm/amdkfd: Renaming dqm->packets to dqm->dpm Oak Zeng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.